{"id":1501,"date":"2025-04-28T14:22:10","date_gmt":"2025-04-28T12:22:10","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?p=1501"},"modified":"2025-05-15T12:36:26","modified_gmt":"2025-05-15T10:36:26","slug":"behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data","status":"publish","type":"post","link":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/","title":{"rendered":"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov"},"content":{"rendered":"<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-0  el_before_av_one_full  avia-builder-el-first  \" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h1><b>Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov<\/b><\/h1>\n<\/div><\/section><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-2  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>Avtorja: dr. \u0160pela Arhar Holdt in Ga\u0161per Jelov\u010dan<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h4><em><strong>V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.<\/strong><\/em><\/h4>\n<\/div><\/section><\/p><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-5  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p>Sloven\u0161\u010dina je z dvema milijonoma govorcev dober primer jezika z manj viri. Kaj pa sploh pomeni, da ima jezik manj virov? Gre za jezik z omejeno digitalno, izobra\u017eevalno in\/ali institucionalno podporo v primerjavi s \u0161iroko uporabljenimi in dobro dokumentiranimi jeziki, kot je angle\u0161\u010dina (Laumann, 2022).<\/p>\n<p>To predstavlja velik izziv za razvoj velikih jezikovnih modelov, ki temeljijo na velikih koli\u010dinah visokokakovostnih u\u010dnih podatkov. V nasprotju z velikimi jeziki z obse\u017enimi podatkovnimi zbirkami ima sloven\u0161\u010dina omejene jezikovne vire, kar ote\u017euje u\u010denje modelov za naloge, kot sta popravljanje \u010drkovanja in slovnice (Arhar Holdt et al., 2025).<\/p>\n<p>Pri razvoju podatkov za velike jezikovne modele, namenjene popravljanju pravopisa in slovnice, za sloven\u0161\u010dino smo se oprli na ugotovitve iz slovenskega razvojnega korpusa \u0160olar 3.0 (Arhar Holdt in Kosem, 2024). V njem je bilo identificiranih 180 razli\u010dnih vrst najbolj zna\u010dilnih jezikovnih napak v sloven\u0161\u010dini. Ocenili smo, da bi za podatkovno zbirko potrebovali vsaj 50 primerov vsake vrste napake, torej skupaj okoli 10.000 primerov.<\/p>\n<p>Da bi dosegli to \u0161tevilo, bomo zdru\u017eili tako sinteti\u010dne kot avtenti\u010dne podatke.<\/p>\n<\/div><\/section><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-7  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3>Uporaba avtenti\u010dnih jezikovnih podatkov<\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p>Naj omenimo tri vire jezikovnih popravkov za sloven\u0161\u010dino: korpus \u0160olar, ki vsebuje besedila osnovno\u0161olcev in srednje\u0161olcev skupaj z u\u010diteljskimi popravki (Kosem idr., 2016), korpus Lektor, ki vsebuje besedila odraslih maternih govorcev skupaj z lektorskimi popravki (Popi\u010d, 2014), in korpus KOST z besedili u\u010dencev sloven\u0161\u010dine kot drugega\/tujega jezika (Stritar Ku\u010duk, 2022).<\/p>\n<p>Poleg tega referen\u010dni korpus pisne sloven\u0161\u010dine Gigafida 2.0 (Krek et al., 2020) vsebuje veliko \u0161tevilo avtenti\u010dnih zgledov, ki jih lahko uporabimo pri pripravi podatkovne mno\u017eice. V ta avtenti\u010dni jezik iz Gigafide 2.0 lahko nato ro\u010dno dodamo napake, npr. s spreminjanjem velikih \u010drk v male, da bi dobili ve\u010d primerov besedila z napakami.<\/p>\n<p>Za pridobitev \u010dim ve\u010d avtenti\u010dnih primerov iz teh korpusov bomo uporabili korpusnojezikovne pristope. V drugem koraku jih bomo uporabili za ustvarjanje dodatnih primerov.<\/p>\n<\/div><\/section><\/p><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-10  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><p><div  class='avia-image-container  av-styling-    avia-builder-el-11  el_before_av_textblock  avia-builder-el-first  avia-align-center '  itemprop=\"image\" itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/ImageObject\"  ><div class='avia-image-container-inner'><div class='avia-image-overlay-wrap'><img class='avia_image' src='https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png' alt='' title='IMG_2696' height=\"3024\" width=\"4032\"  itemprop=\"thumbnailUrl\"  \/><\/div><\/div><\/div><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">Delavnica ustvarjanja izto\u010dnic, ki povezuje strokovnjake s podro\u010dja ra\u010dunalni\u0161tva in leksikografije.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-13  el_after_av_one_full  el_before_av_one_full  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>Generiranje sinteti\u010dnih podatkov<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p>Pri tem delu naloge se bomo oprli na \u017ee obstoje\u010de velike jezikovne modele, kot sta ChatGPT in Gemini. S skrbno oblikovanimi izto\u010dnicami bomo ustvarili sinteti\u010dno podatkovno zbirko. Izto\u010dnice bomo vnesli v ChatGPT in Gemini, ki bosta nato generirala primere slovenskih besedil s slovni\u010dnimi in pravopisnimi napakami. Generirani podatki bodo slu\u017eili kot osnova za u\u010denje modela za popravljanje.<\/p>\n<p>Organizirali smo tudi manj\u0161o delavnico ustvarjanja izto\u010dnic, na kateri so sodelovali strokovnjaki s podro\u010dja ra\u010dunalni\u0161tva in leksikografije. Njihov cilj je bil oblikovati izto\u010dnice, ki bi generirale dodatne primere slovni\u010dnih in pravopisnih napak. Te bomo nato uporabili za u\u010denje velikega jezikovnega modela za preverjanje slovnice in pravopisa.<\/p>\n<p>Tu je nekaj idej z delavnice:<\/p>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Velikemu jezikovnemu modelu bomo posredovali primer avtenti\u010dne slovni\u010dno nepravilne povedi in ga prosili, naj ustvari ve\u010d povedi z enakim tipom napake.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Slovni\u010dno napako bomo opisali in od modela zahtevali dodatne primere.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Modelu bomo posredovali pravopisna pravila in mu naro\u010dili, naj na na podalgi pravil ustvari pravilne in napa\u010dne stavke.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Velikemu jezikovnemu modelu bomo naro\u010dili, naj popravi nepopravljene eseje iz korpusa \u0160olar.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Velikemu jezikovnemu modelu bomo naro\u010dili, naj pretirano izra\u017ea dolo\u010den slogi pisanja, na primer, naj pi\u0161e pretirano jedrnato, umetni\u0161ko ali enostavno.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Modelu bomo naro\u010dili, naj na podlagi smernic o pogostih napakah pri pisanju esejev napi\u0161e eseje s tovrstnimi napakami, nato pa \u0161e popravljene razli\u010dice teh esejev.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Modelu bomo naro\u010dili, da generira besedila, kot jih tipi\u010dno pi\u0161ejo osnovno\u0161olci, srednje\u0161olci in \u0161tudentje, in vanje vklju\u010di pogoste slovni\u010dne napake za vsako skupino.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Modelu bomo naro\u010dili, naj pripravi slovni\u010dne teste s poudarkom na tipi\u010dnih napakah, ki bodo vsebovali pravilen odgovor skupaj z nepravilnim(-i) odgovorom(-i).<\/li>\n<\/ul>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p>Naj poudarimo, da bo raziskovalna skupina zgoraj opisane metode za ustvarjanje sinteti\u010dnih podatkov preizku\u0161ala na razli\u010dnih \u017ee obstoje\u010dih velikih jezikovnih modelih. Po oceni njihove u\u010dinkovitosti bomo uporabili najuspe\u0161nej\u0161e pristope, z namenom generiranja \u010dim ve\u010d primerov besedil z napakami &#8211; v idealnem primeru bi bila ta zelo podobna avtenti\u010dnim podatkom.<\/p>\n<p>Podatkovno mno\u017eico bomo uporabili za razvoj velikega jezikovnega modela za popravljanje pravopisa in slovnice za sloven\u0161\u010dino, ki bo prosto dostopen na repozitoriju CLARIN.SI. Odprto dostopen bo tudi sam model.<\/p>\n<p>Te raziskovalne dejavnosti so del projekta Veliki jezikovni modeli za digitalno humanistiko (LLM4DH), oznaka projekta GC-0002, ki ga financira Javna agencija za raziskovalno dejavnost Republike Slovenije.<\/p>\n<\/div><\/section><\/p><\/div>\n<div class=\"flex_column av_one_full  flex_column_div av-zero-column-padding first  avia-builder-el-18  el_after_av_one_full  avia-builder-el-last  column-top-margin\" style='border-radius:0px; '><p><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p>Novica temelji na \u010dlanku, ki je objavljen na Zenodu in je v angle\u0161\u010dini.<\/p>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>Citiranje:<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">Arhar Holdt, \u0160., &amp; Jelov\u010dan, G. (2025). Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data. Zenodo. <\/span><a href=\"https:\/\/doi.org\/10.5281\/zenodo.15282208\"><span style=\"font-weight: 400;\">https:\/\/doi.org\/10.5281\/zenodo.15282208<\/span><\/a><\/p>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><b>Reference:<\/b><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/BlogPosting\" itemprop=\"blogPost\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><span style=\"font-weight: 400;\">Arhar Holdt, \u0160., &amp; Kosem, I. (2024). \u0160olar, the developmental corpus of Slovene. <\/span><i><span style=\"font-weight: 400;\">Language Resources and Evaluation<\/span><\/i><span style=\"font-weight: 400;\">, 1\u201327. https:\/\/doi.org\/10.1007\/s10579-024-09758-4<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Arhar Holdt, \u0160., Antloga, \u0160., Munda, T., Pori, E., &amp; Krek, S. (2025). From words to action: A national initiative to overcome data scarcity for the Slovene LLM. In <\/span><i><span style=\"font-weight: 400;\">Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)<\/span><\/i><span style=\"font-weight: 400;\"> (pp. 130\u2013136). University of Tartu Library. https:\/\/aclanthology.org\/2025.resourceful.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Krek, S., Arhar Holdt, \u0160., Erjavec, T., \u010cibej, J., Repar, A., Gantar, P., Ljube\u0161i\u0107, N., Kosem, I., &amp; Dobrovoljc, K. (2020). Gigafida 2.0: The reference corpus of written standard Slovene. In <\/span><i><span style=\"font-weight: 400;\">Proceedings of the Twelfth Language Resources and Evaluation Conference<\/span><\/i><span style=\"font-weight: 400;\"> (pp. 3340\u20133345).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Laumann, M. (2022). <\/span><i><span style=\"font-weight: 400;\">Low-resource language: What does it mean?<\/span><\/i><span style=\"font-weight: 400;\"> Medium.<\/span><a href=\"https:\/\/medium.com\/neuralspace\/low-resource-language-what-does-it-mean-d067ec85dea5\"> <span style=\"font-weight: 400;\">https:\/\/medium.com\/neuralspace\/low-resource-language-what-does-it-mean-d067ec85dea5<\/span><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Papers with Code. (2025). <\/span><i><span style=\"font-weight: 400;\">Grammatical error correction<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><a href=\"https:\/\/paperswithcode.com\/task\/grammatical-error-correction\"> <span style=\"font-weight: 400;\">https:\/\/paperswithcode.com\/task\/grammatical-error-correction<\/span><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Popi\u010d, D. (2014). Revising translation revision in Slovenia. In T. Mikoli\u010d Ju\u017eni\u010d, K. Koskinen, &amp; N. Kocijan\u010di\u010d Pokorn (Eds.), <\/span><i><span style=\"font-weight: 400;\">New horizons in translation research and education 2<\/span><\/i><span style=\"font-weight: 400;\"> (pp. 72\u201389). University of Eastern Finland.<\/span><a href=\"https:\/\/erepo.uef.fi\/handle\/123456789\/14340\"> <span style=\"font-weight: 400;\">https:\/\/erepo.uef.fi\/handle\/123456789\/14340<\/span><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Stritar Ku\u010duk, M., et al. (2023). <\/span><i><span style=\"font-weight: 400;\">Slovene learner corpus KOST 2.0<\/span><\/i><span style=\"font-weight: 400;\"> [Language resource]. Slovenian language resource repository CLARIN.SI.<\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1887\"> <span style=\"font-weight: 400;\">http:\/\/hdl.handle.net\/11356\/1887<\/span><\/a><\/p>\n<\/div><\/section><\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.<\/p>\n","protected":false},"author":19,"featured_media":1452,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"","_relevanssi_noindex_reason":"","inline_featured_image":false,"episode_type":"","audio_file":"","podmotor_file_id":"","podmotor_episode_id":"","cover_image":"","cover_image_id":"","duration":"","filesize":"","filesize_raw":"","date_recorded":"","explicit":"","block":"","itunes_episode_number":"","itunes_title":"","itunes_season_number":"","itunes_episode_type":"","footnotes":""},"categories":[82],"tags":[],"class_list":["post-1501","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-novice"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov - LLM4DH<\/title>\n<meta name=\"description\" content=\"V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/\" \/>\n<meta property=\"og:locale\" content=\"sl_SI\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov - LLM4DH\" \/>\n<meta property=\"og:description\" content=\"V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/\" \/>\n<meta property=\"og:site_name\" content=\"LLM4DH\" \/>\n<meta property=\"article:published_time\" content=\"2025-04-28T12:22:10+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-05-15T10:36:26+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696-1030x773.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1030\" \/>\n\t<meta property=\"og:image:height\" content=\"773\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"saras\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"saras\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minut\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/\"},\"author\":{\"name\":\"saras\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/#\\\/schema\\\/person\\\/4d451cdaaa7aa1b00f756029e4b54aa7\"},\"headline\":\"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov\",\"datePublished\":\"2025-04-28T12:22:10+00:00\",\"dateModified\":\"2025-05-15T10:36:26+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/\"},\"wordCount\":2086,\"image\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"articleSection\":[\"Novice\"],\"inLanguage\":\"sl-SI\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/\",\"name\":\"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov - LLM4DH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"datePublished\":\"2025-04-28T12:22:10+00:00\",\"dateModified\":\"2025-05-15T10:36:26+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/#\\\/schema\\\/person\\\/4d451cdaaa7aa1b00f756029e4b54aa7\"},\"description\":\"V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/#breadcrumb\"},\"inLanguage\":\"sl-SI\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"sl-SI\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"contentUrl\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/wp-content\\\/uploads\\\/sites\\\/32\\\/2025\\\/04\\\/IMG_2696.png\",\"width\":4032,\"height\":3024},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/#website\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/\",\"name\":\"LLM4DH\",\"description\":\"Work site\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"sl-SI\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/#\\\/schema\\\/person\\\/4d451cdaaa7aa1b00f756029e4b54aa7\",\"name\":\"saras\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/blog\\\/author\\\/saras\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov - LLM4DH","description":"V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/","og_locale":"sl_SI","og_type":"article","og_title":"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov - LLM4DH","og_description":"V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.","og_url":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/","og_site_name":"LLM4DH","article_published_time":"2025-04-28T12:22:10+00:00","article_modified_time":"2025-05-15T10:36:26+00:00","og_image":[{"width":1030,"height":773,"url":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696-1030x773.png","type":"image\/png"}],"author":"saras","twitter_card":"summary_large_image","twitter_misc":{"Written by":"saras","Est. reading time":"9 minut"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/#article","isPartOf":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/"},"author":{"name":"saras","@id":"https:\/\/www.cjvt.si\/llm4dh\/#\/schema\/person\/4d451cdaaa7aa1b00f756029e4b54aa7"},"headline":"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov","datePublished":"2025-04-28T12:22:10+00:00","dateModified":"2025-05-15T10:36:26+00:00","mainEntityOfPage":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/"},"wordCount":2086,"image":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","articleSection":["Novice"],"inLanguage":"sl-SI"},{"@type":"WebPage","@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/","url":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/","name":"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov - LLM4DH","isPartOf":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/#primaryimage"},"image":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","datePublished":"2025-04-28T12:22:10+00:00","dateModified":"2025-05-15T10:36:26+00:00","author":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/#\/schema\/person\/4d451cdaaa7aa1b00f756029e4b54aa7"},"description":"V \u010dlanku so na kratko predstavljene metode in izto\u010dnice, ki jih uporabljamo pri razvoju velikih jezikovnih modelov za pravopisno in slovni\u010dno popravljanje v sloven\u0161\u010dini.","breadcrumb":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/#breadcrumb"},"inLanguage":"sl-SI","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/"]}]},{"@type":"ImageObject","inLanguage":"sl-SI","@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/#primaryimage","url":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","contentUrl":"https:\/\/www.cjvt.si\/llm4dh\/wp-content\/uploads\/sites\/32\/2025\/04\/IMG_2696.png","width":4032,"height":3024},{"@type":"BreadcrumbList","@id":"https:\/\/www.cjvt.si\/llm4dh\/blog\/behind-the-scenes-of-developing-spell-and-grammar-correction-llm-for-slovenian-combining-authentic-and-synthetic-data-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.cjvt.si\/llm4dh\/"},{"@type":"ListItem","position":2,"name":"Zakulisje razvoja velikega jezikovnega modela za popravljanje \u010drkovanja in slovnice za sloven\u0161\u010dino: kombinacija avtenti\u010dnih in sinteti\u010dnih podatkov"}]},{"@type":"WebSite","@id":"https:\/\/www.cjvt.si\/llm4dh\/#website","url":"https:\/\/www.cjvt.si\/llm4dh\/","name":"LLM4DH","description":"Work site","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.cjvt.si\/llm4dh\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"sl-SI"},{"@type":"Person","@id":"https:\/\/www.cjvt.si\/llm4dh\/#\/schema\/person\/4d451cdaaa7aa1b00f756029e4b54aa7","name":"saras","url":"https:\/\/www.cjvt.si\/llm4dh\/blog\/author\/saras\/"}]}},"_links":{"self":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/posts\/1501","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/comments?post=1501"}],"version-history":[{"count":6,"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/posts\/1501\/revisions"}],"predecessor-version":[{"id":1606,"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/posts\/1501\/revisions\/1606"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/media\/1452"}],"wp:attachment":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/media?parent=1501"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/categories?post=1501"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/wp-json\/wp\/v2\/tags?post=1501"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}