Open-access computationally efficient models for Slovenian

SloLLaMai

Extremely large language models (ChatGPT and GPT-4) have recently shown remarkable progress in certain tasks, but they also face numerous practical challenges in their utilization, such as closedness and lack of transparency, high computational requirements, and the high cost of customization and broader usage, which is unattainable for most research organizations and companies. Further scaling up these models is no longer practical. Smaller, open-access models like LLaMA, Alpaca, GPT4All, and Koala have also emerged, which can be trained or adapted for specific tasks on regular GPU computers and achieve similar or nearly equal performance to the largest models.

In the SloLLaMAi project, we will develop an open-access, computationally efficient generative language model for Slovenian. This model will be the first of its kind for a morphologically rich language with limited resources, presenting a significant research challenge. The development of this new large general model will serve as fundamental infrastructure for industrial projects and all new products requiring natural language processing. Previously developed models (e.g., SloT5, SloBERTa, CroSloEn BERT) have already enabled the creation of technologies that, just a few years ago, could not be developed for Slovenian with comparable accuracy to larger languages (e.g., machine translation, summarization, question answering). The development of such technologies would not have been possible if models for Slovenian did not exist.

The results of the first project (RRP1) are necessary for the development of the next generation of large language models. This will make it possible to prepare general language models that can be specialized for specific natural language processing tasks.

Specific objectives:

  1. Development and construction of modern, computationally efficient large generative language models of the GPT type (variants like LLaMa, Alpaca, Koala) and their adaptation for command tracking, dialogical communication, and the Slovenian language. The built large models represent the foundational infrastructure for the rest of the project and application projects.
  2. Improvement of models by incorporating additional knowledge into the constructed large generative language models to enhance logical reasoning, common-sense reasoning, linguistic and morphological peculiarities of the Slovenian language, and adherence to ethical and legal norms.
  3. Adaptation of large language models for computationally low-capacity devices and industrial applications; researching the possibilities of compression and distillation of large language models, quantization, and approximate computing, taking into account the morphological richness of the addressed language.

Results:

  • D2.1: Open-access large generative language model adapted for dialogue and commands with a size of one billion parameters (August 2024).
  • D2.2: Open-access large generative language model adapted for dialogue and commands with a size of 10 billion parameters (February 2025).
  • D2.3: Open-access computationally lightweight generative language model adapted for dialogue and commands (August 2025).
  • D2.4: Open-access large language model with embedded additional knowledge (February 2026).

Project partners:

Project leader:

Partners: