Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data

By Dr. Špela Arhar Holdt, Gašper Jelovčan and Tinca Lukan

This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.

The Slovenian language, with its community of two million speakers, is a good example of a less-resourced language. What does it mean that a language is less-resourced? It is a language with limited digital, educational, and/or institutional support compared to widely used and well-documented languages, such as English (Laumann, 2022).

This poses a major challenge for the development of large language models (LLMs), which rely on large amounts of high-quality learning data. In contrast to large languages with extensive datasets, Slovene has limited language resources, which makes training models for tasks such as spelling and grammar correction difficult (Arhar Holdt et al., 2025).

In developing data for spelling and grammar correction LLMs for Slovenian, we based our work on findings from the Slovenian developmental corpus Šolar 3.0 (Arhar Holdt & Kosem, 2024). There, 180 different types of the most typical language errors in Slovenian were identified. We estimated that we would need at least 50 examples of each error type for the dataset, i.e. a total of around 10,000 examples.

To achieve this number, we will combine both synthetic and authentic data.

Leveraging authentic language data

For Slovene, three resources with language corrections can be mentioned: the Šolar corpus, which contains texts by primary and secondary school students together with corrections by teachers (Kosem et al., 2016); the Lektor corpus, which contains texts by adult native speakers together with corrections by lectors (Popič, 2014); and the KOST corpus with texts by learners of Slovene as a second/foreign language (Stritar Kučuk, 2022).

Apart from this, the reference corpus of written Slovene Gigafida 2.0 (Krek et al., 2020) provides a large number of authentic examples that can be used in the dataset preparation. This authentic language from Gigafida 2.0 could then be manually corrupted, e.g. by changing uppercase letters to lowercase, in order to obtain more examples of corrupted text.

We will use corpus linguistic approaches to obtain as many authentic examples as possible from these corpora. In the second step, we will use them to generate additional examples.

Prompting workshop bringing together scholars from the fields of computer science and lexicography.

Generating synthetic data

For this part of the task, we will rely on existing LLMs such as ChatGPT and Gemini. We will create a synthetic dataset with carefully designed prompts. These prompts will be input into ChatGPT and Gemini, which will then generate examples of Slovenian texts with grammatical and spelling errors. The generated data will serve as the basis for training our correction model.

To develop the prompts, we organised a small workshop bringing together scholars from the fields of computer science and lexicography. Their goal was to design prompts that generate additional examples of grammar and spelling errors that we will use to train our grammar and spelling checker LLM. Some ideas from the workshop:

  • We will provide the LLM with an example of an authentic grammatically incorrect sentence and ask it to generate more sentences with the same type of error.
  • We will describe the grammatical error and instruct the LLM to produce additional examples.
  • We will provide the LLM with orthography rules and prompt it to generate correct and erroneous sentences based on that knowledge.
  • We will input uncorrected essays from the Šolar corpus and ask the LLM to correct them.
  • We will instruct the LLM to exaggerate different writing styles, such as concise, artistic, and simple.
  • We will provide guidelines on common essay-writing mistakes and ask the LLM to write essays containing these errors, followed by their corrected versions.
  • We will request the LLM to generate typical texts written by middle school, high school, and university students, incorporating common grammatical mistakes for each group.
  • We will prompt the LLM to prepare grammar tests that focus on typical errors and provide the correct answer, together with incorrect answer(s).

It is important to note that our research team will experiment with the methods described above to generate synthetic data using various existing LLMs. After evaluating their effectiveness, we will combine the most successful approaches to generate as many corrupted text examples as possible — ideally those that closely resemble authentic data.

We will use the dataset to develop a spelling and grammar correction LLM for Slovenian and make it openly available in the CLARIN.SI repository. The model itself will also be made openly accessible.

These research activities are part of the project Large Language Models for Digital Humanities (LLM4DH), project code GC-0002, funded by the Slovenian Research and Innovation Agency.

Citation: 

Arhar Holdt, Š., & Jelovčan, G. (2025). Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data. Zenodo. https://doi.org/10.5281/zenodo.15282208

References:

Arhar Holdt, Š., & Kosem, I. (2024). Šolar, the developmental corpus of Slovene. Language Resources and Evaluation, 1–27. https://doi.org/10.1007/s10579-024-09758-4

Arhar Holdt, Š., Antloga, Š., Munda, T., Pori, E., & Krek, S. (2025). From words to action: A national initiative to overcome data scarcity for the Slovene LLM. In Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025) (pp. 130–136). University of Tartu Library. https://aclanthology.org/2025.resourceful.

Krek, S., Arhar Holdt, Š., Erjavec, T., Čibej, J., Repar, A., Gantar, P., Ljubešić, N., Kosem, I., & Dobrovoljc, K. (2020). Gigafida 2.0: The reference corpus of written standard Slovene. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3340–3345).

Laumann, M. (2022). Low-resource language: What does it mean? Medium. https://medium.com/neuralspace/low-resource-language-what-does-it-mean-d067ec85dea5

Papers with Code. (2025). Grammatical error correction. https://paperswithcode.com/task/grammatical-error-correction

Popič, D. (2014). Revising translation revision in Slovenia. In T. Mikolič Južnič, K. Koskinen, & N. Kocijančič Pokorn (Eds.), New horizons in translation research and education 2 (pp. 72–89). University of Eastern Finland. https://erepo.uef.fi/handle/123456789/14340

Stritar Kučuk, M., et al. (2023). Slovene learner corpus KOST 2.0 [Language resource]. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1887