{"id":1448,"date":"2025-04-28T13:20:27","date_gmt":"2025-04-28T11:20:27","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?page_id=1448"},"modified":"2025-04-28T13:21:01","modified_gmt":"2025-04-28T11:21:01","slug":"behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data","status":"publish","type":"page","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/","title":{"rendered":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data"},"content":{"rendered":"

Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data<\/b><\/h1>\n<\/div><\/section><\/div>\n

By Dr. \u0160pela Arhar Holdt and Ga\u0161per Jelov\u010dan<\/b><\/h3>\n<\/div><\/section><\/div>

This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.<\/span><\/p>\n<\/div><\/section><\/div><\/p>\n

<\/div><\/div><\/div><\/div>\n

The Slovenian language, with its community of two million speakers, is a good example of a less-resourced language. What does it mean that a language is less-resourced? It is a language with limited digital, educational, and\/or institutional support compared to widely used and well-documented languages, such as English (Laumann, 2022).<\/span><\/p>\n

This poses a major challenge for the development of large language models (LLMs), which rely on large amounts of high-quality learning data. In contrast to large languages with extensive datasets, Slovene has limited language resources, which makes training models for tasks such as spelling and grammar correction difficult (Arhar Holdt et al., 2025).<\/span><\/p>\n

In developing data for spelling and grammar correction LLMs for Slovenian, we based our work on findings from the Slovenian developmental corpus \u0160olar 3.0 (Arhar Holdt & Kosem, 2024). There, 180 different types of the most typical language errors in Slovenian were identified. We estimated that we would need at least 50 examples of each error type for the dataset, i.e. a total of around 10,000 examples.<\/span><\/p>\n

To achieve this number, we will combine both synthetic and authentic data.<\/span><\/p>\n<\/div><\/section><\/div>\n

Leveraging authentic language data<\/b><\/h5>\n<\/div><\/section>
\n

For Slovene, three resources with language corrections can be mentioned: the \u0160olar corpus, which contains texts by primary and secondary school students together with corrections by teachers (Kosem et al., 2016); the Lektor corpus, which contains texts by adult native speakers together with corrections by lectors (Popi\u010d, 2014); and the KOST corpus with texts by learners of Slovene as a second\/foreign language (Stritar Ku\u010duk, 2022).<\/span><\/p>\n

Apart from this, the reference corpus of written Slovene Gigafida 2.0 (Krek et al., 2020) provides a large number of authentic examples that can be used in the dataset preparation. This authentic language from Gigafida 2.0 could then be manually corrupted, e.g. by changing uppercase letters to lowercase, in order to obtain more examples of corrupted text.<\/span><\/p>\n

We will use corpus linguistic approaches to obtain as many authentic examples as possible from these corpora. In the second step, we will use them to generate additional examples.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\n

Generating synthetic data<\/b><\/h5>\n<\/div><\/section>
\n

For this part of the task, we will rely on existing LLMs such as ChatGPT and Gemini. We will create a synthetic dataset with carefully designed prompts. These prompts will be input into ChatGPT and Gemini, which will then generate examples of Slovenian texts with grammatical and spelling errors. The generated data will serve as the basis for training our correction model.<\/span><\/p>\n

To develop the prompts, we organised a small workshop bringing together scholars from the fields of computer science and lexicography. Their goal was to design prompts that generate additional examples of grammar and spelling errors that we will use to train our grammar and spelling checker LLM. Some ideas from the workshop:<\/span><\/p>\n<\/div><\/section><\/p><\/div>\n