ABOUT THE PROGRAMME

The key objective of the long-term Research, Development and Innovation (RDI) program Adaptive Natural Language Processing with Large Language Models (PoVeJMo) is the development of large language models that impact almost the entire field of artificial intelligence and machine learning. They also have a significant influence on various other areas and society as a whole. The new open-access and computationally efficient language models will serve as the foundation for advanced applications in the fields of medicine, humanities, industrial environments, and software development. Large generative language models and their adaptation for following instructions and dialogical communication will also provide the fundamental infrastructure for artificial intelligence applications in the Slovenian language.

Summary

Extremely large language models, such as ChatGPT and GPT-4, have shown remarkable performance for many tasks and have triggered an avalanche of new AI applications. Unfortunately, due to their closed and non-transparent nature, high computational demands and high cost of customisation, they are out of reach for most research organisations and companies. On the other hand, much smaller, open-source models such as LLaMA have recently emerged, which can be trained or customised for individual tasks on conventional GPU computers and achieve (almost) the same performance.

The project will develop several computationally efficient open-source large language models. The open-source SloLLaMa model for Slovene will be the first such model for a morphologically rich language. The developed instruction dataset will form the basis for further adaptations of the SloLLaMa model to specific applications, and will be available for wider academic and industrial use.

In application projects, we will adapt the basic SloLLaMa model in the following ways.

  1. For preparing presentation materials and data descriptions in museums, and for advanced interactive museum applications.
  2. For use in Slovenian speech recognition and synthesis, where a computationally efficient version of the SloLLaMa model will enable the integration of speech technologies into advanced industrial applications.
  3. For medical applications, where we will adapt the SloLLaMa model to medical texts and instructions in clinical use.
  4. For use in infrastructure code generation, where the computationally efficient large language model technology and the developed pipelines for instruction-following datasets will enable infrastructure code generation.

In addition to the applications, the project will build a core infrastructure for AI applications in Slovene and develop solutions that will be useful for other less-resourced languages.