Projects

In the first project, SloSBZ: General Knowledge Base for Slovenian (RRP1), we will initially build relevant language resources (a corpus of dialogues and command-tracking) and then in the second project, SloLLaMai: Open-access computationally efficient models for Slovenian (RRP2), we will develop an open-access computationally efficient generative language model for Slovenian. This will be the first model of its kind for a morphologically rich language with limited resources, presenting a significant research challenge. The prepared command-tracking corpus will serve as the foundation for further adaptations of the Slovenian generative language model to the specific needs of applications and will be available for broader academic and industrial use.

In applied projects, we will adapt the base language model with additional command and dialogical training sets for specific applications. This will enable significant improvements and practical use in integrating speech technologies into advanced industrial applications in projects:

The scientific research partners leading projects RRP1 and RRP2 provide the necessary tools and frameworks for building large language models. At least one general language model will be developed, suitable for adaptation for various natural language processing applications (e.g., smart assistants, question answering, translation, common-sense reasoning). The performance of this model will be comparable to that of the GPT-4 model, which serves as the basis for large customized models (e.g., ChatGPT) used in various applications. Current publicly available large models have only been trained on a fraction of Slovenian text, resulting in suboptimal performance for Slovenian compared to larger languages. Creating such a model will give Slovenian research and business sectors a competitive advantage, aligning them with major economies in the markets of large language groups.

Industrial partners leading projects RRP3, RRP4, RRP5, and RRP6 will prepare specialized corpora and use them to customize the large language model from RRP2. This customization will enable the creation of new innovative products or upgrades to existing ones. Each partner will develop at least one innovation and offer it to existing customers in connection with natural language processing, aiming to further improve their position in the market.

RRP1: SloSBZ

General Knowledge Base for Slovenian

RRP2: SloLLaMai

Open-access computationally efficient models for Slovenian

RRP3: VeMo-Digi

Large language models for advanced domain-specific digitization

RRP4: VeMo-Med

Adaptation of large language models for the use of Slovenian in medicine

RRP5: VeMo-Ind

Adaptation of large language models for use in industrial environments

RRP6: VeMo-IaC

Adaptation of large language models for generating infrastructure descriptions in code