{"id":1462,"date":"2025-04-28T13:35:56","date_gmt":"2025-04-28T11:35:56","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?p=1462"},"modified":"2025-05-05T13:43:58","modified_gmt":"2025-05-05T11:43:58","slug":"behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data","status":"publish","type":"post","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/","title":{"rendered":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data"},"content":{"rendered":"
The Slovenian language, with its community of two million speakers, is a good example of a less-resourced language. What does it mean that a language is less-resourced? It is a language with limited digital, educational, and\/or institutional support compared to widely used and well-documented languages, such as English (Laumann, 2022).<\/span><\/p>\n This poses a major challenge for the development of large language models (LLMs), which rely on large amounts of high-quality learning data. In contrast to large languages with extensive datasets, Slovene has limited language resources, which makes training models for tasks such as spelling and grammar correction difficult (Arhar Holdt et al., 2025).<\/span><\/p>\n In developing data for spelling and grammar correction LLMs for Slovenian, we based our work on findings from the Slovenian developmental corpus \u0160olar 3.0 (Arhar Holdt & Kosem, 2024). There, 180 different types of the most typical language errors in Slovenian were identified. We estimated that we would need at least 50 examples of each error type for the dataset, i.e. a total of around 10,000 examples.<\/span><\/p>\n To achieve this number, we will combine both synthetic and authentic data.<\/span><\/p>\n<\/div><\/section><\/div><\/p>\n For Slovene, three resources with language corrections can be mentioned: the \u0160olar corpus, which contains texts by primary and secondary school students together with corrections by teachers (Kosem et al., 2016); the Lektor corpus, which contains texts by adult native speakers together with corrections by lectors (Popi\u010d, 2014); and the KOST corpus with texts by learners of Slovene as a second\/foreign language (Stritar Ku\u010duk, 2022).<\/span><\/p>\n Apart from this, the reference corpus of written Slovene Gigafida 2.0 (Krek et al., 2020) provides a large number of authentic examples that can be used in the dataset preparation. This authentic language from Gigafida 2.0 could then be manually corrupted, e.g. by changing uppercase letters to lowercase, in order to obtain more examples of corrupted text.<\/span><\/p>\n We will use corpus linguistic approaches to obtain as many authentic examples as possible from these corpora. In the second step, we will use them to generate additional examples.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\nLeveraging authentic language data<\/b><\/h3>\n<\/div><\/section>
\n