Work Package 1: SloSBZ

General Knowledge Base for Slovenian

SloSBZ

The project’s aim is the development of Slovenian language resources for constructing large generative language models and supportive tools for model preparation and usage during training when necessary. For the successful preparation of models in the SloLLamAI project (RRP2), its adaptation, and demonstrations within the framework of the other four industrial projects (RRP3-6), we require significant amounts of high-quality dialogues and command requests with ranked responses, as well as fundamental large language corpora covering conversational language and specialized terminological areas.

The project is one of two foundational projects within the program and provides the basic language infrastructure needed for model training. This means that the work in the project will primarily involve preparing tools and databases, which will then be used in the preparation of the base SloLLaMai model and later in the experimental development of demonstrations of individual technologies within the industrial projects. The results of the other projects will be fed back to the SloLLaMai project through its requirements, guiding the further development of corpora. Three versions of a digital dictionary base will also be released within the project.

Specific objectives:

Provide corpora suitable for training large models that can be used within the program and beyond.
Provide tools that will support the preparation of the mentioned corpora.
Ensure language support for Slovenian by adapting existing corpora and creating new ones based on the needs of industrial research and experimental development during the project.
Provide upgrades to the pipelines of extended libraries for training and using large language models.
Provide validation and test sets for validating the generated models.

Expected results:

D1.1: Open-access Slovenian training set for dialogues and command requests (August 2024).
D1.2: Large language corpus for conversational language and addressed terminological areas – first version (August 2024).
D1.3: Validation corpus for large language models (September 2024).
D14: Tools for preparing dictionary bases for model training and components for integrating open dictionary forms (February 2025).
D1.5: Dedicated tokenizers for the Slovenian language – first version (February 2025).
D1.6: Large language corpus (for conversational language) and addressed terminological areas – second version (August 2025).
D1.7: Dedicated tokenizers for the Slovenian language – final version (March 2026).
D1.8: Large language corpus (for conversational language) and addressed terminological areas – final version (June 2026).
D1.9: Knowledge base created based on the Digital Dictionary Base (July 2026).