{"id":1462,"date":"2025-04-28T13:35:56","date_gmt":"2025-04-28T11:35:56","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?p=1462"},"modified":"2025-05-05T13:43:58","modified_gmt":"2025-05-05T11:43:58","slug":"behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data","status":"publish","type":"post","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/","title":{"rendered":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data"},"content":{"rendered":"<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-0  el_before_av_one_full  avia-builder-el-first  \" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h1><b>Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data<\/b><\/h1>\n<\/div><\/section><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-2  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>By Dr. \u0160pela Arhar Holdt and Ga\u0161per Jelov\u010dan<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h4><em><strong>This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.<\/strong><\/em><\/h4>\n<\/div><\/section><\/p><\/div><div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-5  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">The Slovenian language, with its community of two million speakers, is a good example of a less-resourced language. What does it mean that a language is less-resourced? It is a language with limited digital, educational, and\/or institutional support compared to widely used and well-documented languages, such as English (Laumann, 2022).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This poses a major challenge for the development of large language models (LLMs), which rely on large amounts of high-quality learning data. In contrast to large languages with extensive datasets, Slovene has limited language resources, which makes training models for tasks such as spelling and grammar correction difficult (Arhar Holdt et al., 2025).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In developing data for spelling and grammar correction LLMs for Slovenian, we based our work on findings from the Slovenian developmental corpus \u0160olar 3.0 (Arhar Holdt &amp; Kosem, 2024). There, 180 different types of the most typical language errors in Slovenian were identified. We estimated that we would need at least 50 examples of each error type for the dataset, i.e. a total of around 10,000 examples.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To achieve this number, we will combine both synthetic and authentic data.<\/span><\/p>\n<\/div><\/section><\/div><\/p>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-7  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>Leveraging authentic language data<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">For Slovene, three resources with language corrections can be mentioned: the \u0160olar corpus, which contains texts by primary and secondary school students together with corrections by teachers (Kosem et al., 2016); the Lektor corpus, which contains texts by adult native speakers together with corrections by lectors (Popi\u010d, 2014); and the KOST corpus with texts by learners of Slovene as a second\/foreign language (Stritar Ku\u010duk, 2022).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Apart from this, the reference corpus of written Slovene Gigafida 2.0 (Krek et al., 2020) provides a large number of authentic examples that can be used in the dataset preparation. This authentic language from Gigafida 2.0 could then be manually corrupted, e.g. by changing uppercase letters to lowercase, in order to obtain more examples of corrupted text.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We will use corpus linguistic approaches to obtain as many authentic examples as possible from these corpora. In the second step, we will use them to generate additional examples.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-10  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><p><div  class='avia-image-container  av-styling-    avia-builder-el-11  el_before_av_textblock  avia-builder-el-first  avia-align-center '  itemprop=\"image\" itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/ImageObject\"  ><div class='avia-image-container-inner'><div class='avia-image-overlay-wrap'><img class='avia_image' src='https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png' alt='' title='IMG_2696' height=\"3024\" width=\"4032\"  itemprop=\"thumbnailUrl\"  \/><\/div><\/div><\/div><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">Prompting workshop bringing together scholars from the fields of computer science and lexicography. <\/span><\/p>\n<\/div><\/section><\/p><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-13  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>Generating synthetic data<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">For this part of the task, we will rely on existing LLMs such as ChatGPT and Gemini. We will create a synthetic dataset with carefully designed prompts. These prompts will be input into ChatGPT and Gemini, which will then generate examples of Slovenian texts with grammatical and spelling errors. The generated data will serve as the basis for training our correction model.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To develop the prompts, we organised a small workshop bringing together scholars from the fields of computer science and lexicography. Their goal was to design prompts that generate additional examples of grammar and spelling errors that we will use to train our grammar and spelling checker LLM. Some ideas from the workshop:<\/span><\/p>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will provide the LLM with an example of an authentic grammatically incorrect sentence and ask it to generate more sentences with the same type of error.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will describe the grammatical error and instruct the LLM to produce additional examples.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will provide the LLM with orthography rules and prompt it to generate correct and erroneous sentences based on that knowledge.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will input uncorrected essays from the \u0160olar corpus and ask the LLM to correct them.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will instruct the LLM to exaggerate different writing styles, such as concise, artistic, and simple.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will provide guidelines on common essay-writing mistakes and ask the LLM to write essays containing these errors, followed by their corrected versions.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will request the LLM to generate typical texts written by middle school, high school, and university students, incorporating common grammatical mistakes for each group.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We will prompt the LLM to prepare grammar tests that focus on typical errors and provide the correct answer, together with incorrect answer(s).<\/span><\/li>\n<\/ul>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">It is important to note that our research team will experiment with the methods described above to generate synthetic data using various existing LLMs. After evaluating their effectiveness, we will combine the most successful approaches to generate as many corrupted text examples as possible \u2014 ideally those that closely resemble authentic data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We will use the dataset to develop a spelling and grammar correction LLM for Slovenian and make it openly available in the CLARIN.SI repository. The model itself will also be made openly accessible.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These research activities are part of the project <\/span><i><span style=\"font-weight: 400;\">Large Language Models for Digital Humanities (LLM4DH)<\/span><\/i><span style=\"font-weight: 400;\">, project code GC-0002, funded by the Slovenian Research and Innovation Agency.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-18  el_after_av_one_full  avia-builder-el-last  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>Citation:\u00a0<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">Arhar Holdt, \u0160., &amp; Jelov\u010dan, G. (2025). Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data. Zenodo. <\/span><a href=\"https:\/\/doi.org\/10.5281\/zenodo.15282208\"><span style=\"font-weight: 400;\">https:\/\/doi.org\/10.5281\/zenodo.15282208<\/span><\/a><\/p>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>References:<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">Arhar Holdt, \u0160., &amp; Kosem, I. (2024). \u0160olar, the developmental corpus of Slovene. <\/span><i><span style=\"font-weight: 400;\">Language Resources and Evaluation<\/span><\/i><span style=\"font-weight: 400;\">, 1\u201327. https:\/\/doi.org\/10.1007\/s10579-024-09758-4<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Arhar Holdt, \u0160., Antloga, \u0160., Munda, T., Pori, E., &amp; Krek, S. (2025). From words to action: A national initiative to overcome data scarcity for the Slovene LLM. In <\/span><i><span style=\"font-weight: 400;\">Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)<\/span><\/i><span style=\"font-weight: 400;\"> (pp. 130\u2013136). University of Tartu Library. https:\/\/aclanthology.org\/2025.resourceful.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Krek, S., Arhar Holdt, \u0160., Erjavec, T., \u010cibej, J., Repar, A., Gantar, P., Ljube\u0161i\u0107, N., Kosem, I., &amp; Dobrovoljc, K. (2020). Gigafida 2.0: The reference corpus of written standard Slovene. In <\/span><i><span style=\"font-weight: 400;\">Proceedings of the Twelfth Language Resources and Evaluation Conference<\/span><\/i><span style=\"font-weight: 400;\"> (pp. 3340\u20133345).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Laumann, M. (2022). <\/span><i><span style=\"font-weight: 400;\">Low-resource language: What does it mean?<\/span><\/i><span style=\"font-weight: 400;\"> Medium.<\/span><a href=\"https:\/\/medium.com\/neuralspace\/low-resource-language-what-does-it-mean-d067ec85dea5\"> <span style=\"font-weight: 400;\">https:\/\/medium.com\/neuralspace\/low-resource-language-what-does-it-mean-d067ec85dea5<\/span><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Papers with Code. (2025). <\/span><i><span style=\"font-weight: 400;\">Grammatical error correction<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><a href=\"https:\/\/paperswithcode.com\/task\/grammatical-error-correction\"> <span style=\"font-weight: 400;\">https:\/\/paperswithcode.com\/task\/grammatical-error-correction<\/span><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Popi\u010d, D. (2014). Revising translation revision in Slovenia. In T. Mikoli\u010d Ju\u017eni\u010d, K. Koskinen, &amp; N. Kocijan\u010di\u010d Pokorn (Eds.), <\/span><i><span style=\"font-weight: 400;\">New horizons in translation research and education 2<\/span><\/i><span style=\"font-weight: 400;\"> (pp. 72\u201389). University of Eastern Finland.<\/span><a href=\"https:\/\/erepo.uef.fi\/handle\/123456789\/14340\"> <span style=\"font-weight: 400;\">https:\/\/erepo.uef.fi\/handle\/123456789\/14340<\/span><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Stritar Ku\u010duk, M., et al. (2023). <\/span><i><span style=\"font-weight: 400;\">Slovene learner corpus KOST 2.0<\/span><\/i><span style=\"font-weight: 400;\"> [Language resource]. Slovenian language resource repository CLARIN.SI.<\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1887\"> <span style=\"font-weight: 400;\">http:\/\/hdl.handle.net\/11356\/1887<\/span><\/a><\/p>\n<\/div><\/section><\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.<\/p>\n","protected":false},"author":19,"featured_media":1452,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"","_relevanssi_noindex_reason":"","inline_featured_image":false,"episode_type":"","audio_file":"","podmotor_file_id":"","podmotor_episode_id":"","cover_image":"","cover_image_id":"","duration":"","filesize":"","filesize_raw":"","date_recorded":"","explicit":"","block":"","itunes_episode_number":"","itunes_title":"","itunes_season_number":"","itunes_episode_type":"","footnotes":""},"categories":[84,83],"tags":[],"class_list":["post-1462","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog-posts","category-novice-2"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data - LLM4DH<\/title>\n<meta name=\"description\" content=\"This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data - LLM4DH\" \/>\n<meta property=\"og:description\" content=\"This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/\" \/>\n<meta property=\"og:site_name\" content=\"LLM4DH\" \/>\n<meta property=\"article:published_time\" content=\"2025-04-28T11:35:56+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-05-05T11:43:58+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696-1030x773.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1030\" \/>\n\t<meta property=\"og:image:height\" content=\"773\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"saras\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"saras\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/\"},\"author\":{\"name\":\"saras\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/#\\\/schema\\\/person\\\/4d451cdaaa7aa1b00f756029e4b54aa7\"},\"headline\":\"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data\",\"datePublished\":\"2025-04-28T11:35:56+00:00\",\"dateModified\":\"2025-05-05T11:43:58+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/\"},\"wordCount\":2022,\"image\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"articleSection\":[\"Blog Posts\",\"Novice\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/\",\"name\":\"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data - LLM4DH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"datePublished\":\"2025-04-28T11:35:56+00:00\",\"dateModified\":\"2025-05-05T11:43:58+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/#\\\/schema\\\/person\\\/4d451cdaaa7aa1b00f756029e4b54aa7\"},\"description\":\"This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"contentUrl\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"width\":4032,\"height\":3024},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/\",\"name\":\"LLM4DH\",\"description\":\"Work site\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/#\\\/schema\\\/person\\\/4d451cdaaa7aa1b00f756029e4b54aa7\",\"name\":\"saras\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/blog\\\/author\\\/saras\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data - LLM4DH","description":"This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/","og_locale":"en_US","og_type":"article","og_title":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data - LLM4DH","og_description":"This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.","og_url":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/","og_site_name":"LLM4DH","article_published_time":"2025-04-28T11:35:56+00:00","article_modified_time":"2025-05-05T11:43:58+00:00","og_image":[{"width":1030,"height":773,"url":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696-1030x773.png","type":"image\/png"}],"author":"saras","twitter_card":"summary_large_image","twitter_misc":{"Written by":"saras","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/#article","isPartOf":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/"},"author":{"name":"saras","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/#\/schema\/person\/4d451cdaaa7aa1b00f756029e4b54aa7"},"headline":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data","datePublished":"2025-04-28T11:35:56+00:00","dateModified":"2025-05-05T11:43:58+00:00","mainEntityOfPage":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/"},"wordCount":2022,"image":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/#primaryimage"},"thumbnailUrl":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","articleSection":["Blog Posts","Novice"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/","url":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/","name":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data - LLM4DH","isPartOf":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/#primaryimage"},"image":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/#primaryimage"},"thumbnailUrl":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","datePublished":"2025-04-28T11:35:56+00:00","dateModified":"2025-05-05T11:43:58+00:00","author":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/#\/schema\/person\/4d451cdaaa7aa1b00f756029e4b54aa7"},"description":"This article briefly explains the methods and prompts used to develop spelling and grammar correction LLMs for the Slovenian language.","breadcrumb":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/#primaryimage","url":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","contentUrl":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","width":4032,"height":3024},{"@type":"BreadcrumbList","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.cjvt.si\/llm4dh\/en\/"},{"@type":"ListItem","position":2,"name":"Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data"}]},{"@type":"WebSite","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/#website","url":"https:\/\/www.cjvt.si\/llm4dh\/en\/","name":"LLM4DH","description":"Work site","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.cjvt.si\/llm4dh\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/#\/schema\/person\/4d451cdaaa7aa1b00f756029e4b54aa7","name":"saras","url":"https:\/\/www.cjvt.si\/llm4dh\/en\/blog\/author\/saras\/"}]}},"_links":{"self":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/posts\/1462","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/comments?post=1462"}],"version-history":[{"count":8,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/posts\/1462\/revisions"}],"predecessor-version":[{"id":1622,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/posts\/1462\/revisions\/1622"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/media\/1452"}],"wp:attachment":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/media?parent=1462"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/categories?post=1462"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/tags?post=1462"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}