Large language models for advanced domain-specific digitization

VeMo-Digi

The project will apply language models to digitalization across various domains. Semantika d.o.o. already has a developed range of solutions marketed internationally (e.g., the Museums platform), with one of the significant challenges being the preparation of multilingual materials for presentation materials. By developing a machine translator and subsequently integrating it into products, Slovenian users will have an easier reach to an international audience for the company’s product portfolio, achieving a multiplicative effect in results. These solutions can also be expanded to other less represented languages in the markets (e.g., Bosnia and Herzegovina).

Specific objectives:

  1. Create a command or dialog dataset with dialogues and instructions from the field of humanities.
  2. Upgrade and expand the existing infrastructure for digitizing museum collections and archival materials using optical character recognition enhanced with context-sensitive spelling with the SloLLaMai model.
  3. Demonstration of a semantic search engine with natural language communication support, implementing it in the case of an internal document search engine with a PoC in digital humanities and highly regulated industries.
  4. Demonstration of automatic generation, editing, and optimization of texts for online publication for museum documentation and materials.
  5. Demonstration of automatic searching, extraction, and summarization of relevant information from freely available sources based on a provided topic (e.g., preparing a short description of the work’s author based on Wikipedia).
  6. Demonstration of machine interpretation of sequences of instructions in natural language for working with an application.
  7. Demonstration of machine translation of texts using large language models between Slovenian and a) other languages and b) older forms of Slovenian, with translation of archival materials from INZ into contemporary Slovenian.
  8. Demonstration of a speech interface using the example of an electronic guide.
  9. Demonstration of solutions for automatic entity extraction, such as people, places, etc., using large language models, and a demonstration of log anonymization based on them.

Results:

  • D3.1: Training dataset with at least 10,000 examples (August 2023).
  • D3.2: Calibrated SloLLaMai models for humanities and instruction tracking (February 2025).
  • D3.3: Demonstration application for OCR (August 2025).
  • D3.4: Demonstration application for semantic search (August 2025).
  • D3.5: Demonstration application for automatic generation of collection descriptions (February 2026).
  • D3.6: Demonstration application for summarizer (February 2026).
  • D3.7: Demonstration application for translator (February 2026).
  • D3.8: Demonstration application for translating between natural language instructions and command language (February 2026).
  • D3.9: Development of an application for machine entity extraction and document anonymization, with a demonstration on Semantika’s datasets (February 2026).

Project partners:

Project leader:
Partners: