About the project - Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti

Project scope

The aim of the project Empirical foundations for digitally-supported development of writing skills (PROP) is to support teachers who correct and grade student writing. The development of writing competency involves practising writing skills – however, more writing also means more work for teachers. Research has shown that giving individualised, goal-oriented and formative feedback leads to best literacy results. On the other hand, it is time consuming and demands support in the form of adequate descriptors, indicators and, not least, information about modern language use in various communication situations. In many languages, including Slovene, these conditions are not met, which is why there is (too) little school writing, while feedback often remains limited to surface corrections of errors, such as grammatical mistakes.

We believe the solution lies in digital support of teachers’ work. On the one hand, automatic identification and substantive categorisation of grammatical errors would free teachers from routine corrections and give them more time to pursue advanced teaching objectives. On the other, a digitally-supported model of providing feedback, based on empirically founded indicators and descriptors, would ease the preparation of corrective study materials and allow for peer assessment and long-term development monitoring. Advances in the field of natural language processing, machine learning, and corpus linguistics make this an attainable goal, which is attested by a variety of digital tools, prototypes, and portals currently being designed. The innovative aspect of the project, which necessitates interdisciplinary collaboration, lies in its proposal of solutions that are based on empirical data: authentic teacher practices coupled with data on real, modern language use.

Pertaining to the latter, Slovene may have a relative advantage over other languages, since it already possesses a corpus of student texts from Slovene primary and secondary schools, which also includes teacher corrections categorised by type of language problem. This corpus represents untapped potential for empirical analyses of authentic school written production and language corrections and for development of a tool that would automatically categorise language problems based on real-life principles of teacher correction.

During the course of the project, we will use automatic extraction and analyses of richly annotated corpora to collect empirical data needed for specifying developmental indicators and descriptors for various educational stages. We will then use this information to design feedback scenarios for different language levels: spelling, morphology, vocabulary, and syntax. We will expand the corpus of school texts by including examples of student writing from the tertiary level and empirically research the specifics of providing feedback in higher education settings. Next, we will develop a tool that automatically identifies language problems in a given text, taking into account the level of the writer. Given the rarity of language resources such as corpus Šolar, we will also test the performance of the tool on other comparable training datasets, adapt it to be more independent and thus make the methodology applicable to other languages.

User research is a key part of our project: we will examine existing practices of providing teacher feedback in the development of writing skills, first by conducting a web survey and then by recording teachers’ screens while they are correcting in a digital environment. Teachers and students will furthermore conduct user evaluations of the solutions, developed in the project. Lastly, we intend to combine findings in formative assessment with crowdsourcing and apply this to language didactics in order to develop a strategy of digitally-supported development of writing skills, which will take into consideration all the necessary didactical and ethical issues related to the field.

Project goals

Corpora and corpus data: richly annotate a corpus of student writing, a corpus of school textbooks, and a corpus of literature aimed at youngsters and young adults; use data extraction and corpus analyses to facilitate empirical foundations for developmental indicators and descriptors for various educational stages; use these results to develop feedback scenarios on the levels of spelling, morphology, vocabulary, and syntax; build a pilot corpus of student academic writing and include it in all of the above steps.

Software module: develop a software module which automatically identifies and categorises language errors on different language levels; adapt the software to teachers’ needs and the specifics of providing feedback at different educational stages; create the foundation for applying the methodology to other languages.

User research: empirically research existing teaching practices of providing feedback for the development of writing skills with the aid of a) an online survey and b) screen recording of teacher corrections in a digital environment; create the basis for comparable research in other languages; include teachers/lecturers in the evaluation of the software module and teachers/lecturers and pupils/students in the evaluation of the corpus-based feedback scenarios.

Models and strategies: combine findings from the fields of formative assessment and crowdsourcing for the needs of language education and create a model for providing digitally-supported feedback to help the development of writing skills; form a strategy of digitally-supported development of writing skills.

Dissemination of research results: ensure that the results are published in keeping with the National strategy for open access to scientific publications and research data in Slovenia; inform the scientific and general public about project results (scientific publications, events, website) and encourage further exploitation of the results.

Project group

University of Ljubljana, Faculty of Arts
– Špela Arhar Holdt, 27674
– Iztok Kosem, 33796
– Polona Gantar, 16313
– Marko Stabej, 11651
– Teja Goli, 52176
– Magdalena Gapsa, 53628
– Mija Bon, 51891
– Eva Pori, 51456
– Tina Munda

University of Ljubljana, Faculty of Computer and Information Science
– Marko Robnik-Šikonja, 15295
– Simon Krek, 26166
– Matej Ulčar, 55173
– Aleš Žagar, 56007
– Matej Klemen, 55754
– Martin Božič, 58277
– Gašper Jelovčan, 59561
– Tadej Škvorc, 50769
– Sara Sever
– Tinca Lukan

University of Ljubljana, Faculty of Public Administration
– Tadeja Rozman, 25578

University of Ljubljana, Faculty of Education
– Karmen Pižorn, 21612
– Alenka Rot Vrhovec, 34816
– Lara Godec Soršak, 25590
– Milena Košak Babuder, 26199
– Tomaž Petek, 32433

Project timeline

1. Corpus analysis of written production at various educational stages
– Corpus data preparation for linguistic and machine tasks [M1-6]
– Compiling a pilot corpus of student academic texts [M1-12]
– Quantitative and qualitative linguistic analyses of student writing [M1-18]
– Empirical data for developmental indicators on levels of vocabulary and syntax [M7-18]
2. Practice-based digitally-supported development of writing skills
– Questionnaire survey about teaching practices used to develop writing skills [M1-18]
– Recording teacher corrections of student writing and semi-structured interviews [M10-24]
– Designing a strategy for digitally-supported development of writing skills [M19-35]
3. Development and evaluation of automatic identification and categorisation of language problems
– Designing a model for automatic error annotation [M7-35]
– Testing the applicability of methodology to other languages [M25-35]
– Linguistic and teacher evaluation of automatic annotation of texts [M13-35]
4. Providing feedback in digital environment
– Developing a model combining formative assessment and crowdsourcing [M7-18]
– Corpus-based and scaffolded feedback scenarios [M13-24]
– Testing feedback with target user groups [M22-35]
5. Coordination and dissemination
– Coordination, reporting and dissemination [M1-36]
– Scientific publications and research data [M1-36]

Corpus analysis of written production at various educational stages

Corpus data preparation for linguistic and machine tasks

As part of the project, we prepared three text corpora specialized for educational use: the corpus of student writing Šolar 3.0, the open-access school textbook corpus ccUčbeniki 1.0, and the youth literature corpus ccMaks 1.0. Prior to the project, Šolar was available in version 2.0, Maks was only accessible via concordancers (not as a full database), and the Učbeniki corpus was not publicly available. All three resources had previously been linguistically annotated, but using older automatic tagging tools. In this project, we uniformly re-annotated the corpora using the CLASSLA v1.1.1 tagger, covering tokenization, sentence segmentation, lemmatization, and morphosyntactic tagging according to the MULTEXT-East v6 standard, as well as dependency syntax (JOS) and named entity recognition. In collaboration with other projects, we also ensured that two additional corpora for Slovene as a second language were prepared using the same standards: the KUUS textbook corpus and the KOST corpus of Slovene as a foreign language. This uniformity allows for more reliable data comparison, while higher-level linguistic annotations support more advanced linguistic and computational analyses, as well as better data usability for machine learning applications.

The corpora are openly available in the CLARIN.SI repository:

ARHAR HOLDT, Špela, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, KOSEM, Iztok. Developmental corpus Šolar 3.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1589. [COBISS.SI-ID 124160003]

KOSEM, Iztok, PORI, Eva, ŽAGAR, Aleš, ARHAR HOLDT, Špela. Corpus of Slovenian textbooks ccUčbeniki 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1693. [COBISS.SI-ID 129443843]

VERDONIK, Darinka, MAJNINGER, Sandi, DOBROVOLJC, Kaja, ANTLOGA, Špela, ZÖGLING MARKUŠ, Aleksandra, VORŠIČ, Ines, ZEMLJAK JONTES, Melita, KOLETNIK, Mihaela, VALH LOPERT, Alenka, ŠEK, Polonca, KOSEM, Iztok, MAJHENIČ, Simona, FERME, Marko, ŽAGAR, Aleš, ARHAR HOLDT, Špela. Corpus of Slovenian texts for pedagogical purposes ccMAKS 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1692. [COBISS.SI-ID 129467395]

The corpus preparation process—which is particularly demanding for corpora containing linguistic corrections, such as Šolar 3.0—was presented at conferences, in a monograph, and in the prestigious journal Language Resources and Evaluation.

ARHAR HOLDT, Špela, KOSEM, Iztok. Šolar, the developmental corpus of Slovene. Language resources and evaluation. 2024, str. 1-27. DOI: 10.1007/s10579-024-09758-4. [COBISS.SI-ID 204228867]

ARHAR HOLDT, Špela, KOSEM, Iztok, STRITAR KUČUK, Mojca. Metode in orodja za lažjo pripravo korpusov usvajanja jezika. V: PIRIH SVETINA, Nataša (ur.), FERBEŽAR, Ina (ur.). Na stičišču svetov : slovenščina kot drugi in tuji jezik. 1. natis. Ljubljana: Založba Univerze, 2022. Str. 23-30. Zbirka Obdobja, 41. DOI: 10.4312/Obdobja.41.23-30. [COBISS.SI-ID 129063939]

ARHAR HOLDT, Špela, PORI, Eva, KOSEM, Iztok. Prihodnost korpusa Šolar. V: ARHAR HOLDT, Špela (ur.), KREK, Simon (ur.). Razvoj slovenščine v digitalnem okolju. Ljubljana: Založba Univerze, 2023. Str. 61-91. Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/522/852/9442. [COBISS.SI-ID 185543683]

KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, HUBER, Damjan, LUTAR, Mateja. Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. V: PIRIH SVETINA, Nataša (ur.), FERBEŽAR, Ina (ur.). Na stičišču svetov : slovenščina kot drugi in tuji jezik. Ljubljana: Založba Univerze, 2022. Str. 165-174. Zbirka Obdobja, 41. DOI: 10.4312/Obdobja.41.165-174. [COBISS.SI-ID 129975811]

Compiling a pilot corpus of student academic texts

We developed a new pilot corpus of student writing, KOŠ, which includes written texts by students from the Faculty of Public Administration and the Faculty of Education at the University of Ljubljana. The corpus contains 426 texts (542,066 tokens). The texts were collected following the Šolar corpus preparation methodology, which involves recording all relevant metadata, including teacher-provided language corrections, ensuring legal compliance for open access, and formatting the data in a compatible format.

Text collection agreement for students: digital signature / manual signature.

Text collection agreement for professors: digital signature / manual signature.

ROZMAN, Tadeja, ARHAR HOLDT, Špela, ŽAGAR, Aleš, STABEJ, Marko, PERME, Kaja, ZUPAN, Neža, GODEC SORŠAK, Lara. Pilot corpus of student academic texts KOŠ 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/2048. [COBISS.SI-ID 258382339]

ROZMAN, Tadeja, ARHAR HOLDT, Špela. Gradnja Korpusa študentskih besedil KOŠ. V: FIŠER, Darja (ur.), ERJAVEC, Tomaž (ur.). Jezikovne tehnologije in digitalna humanistika: zbornik konference: 15.-16. september 2022, Ljubljana, Slovenija. Ljubljana: Inštitut za novejšo zgodovino, 2022. Str. 267-270. https://nl.ijs.si/jtdh22/pdf/JTDH2022_Rozman_ArharHoldt_Gradnja-Korpusa-studentskih-besedil-KOS.pdf. [COBISS.SI-ID 131012099] Posnetek predstavitve.

ROZMAN, Tadeja. Pilotni korpus KOŠ in smernice za gradnjo korpusa študentskih besedil. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2025, letn. 13, št. 1, str. 120-137, DOI: 10.4312/slo2.0.2025.1.120-137. [COBISS.SI-ID 263403523]

Since writing and the provision of feedback at the tertiary level differ somewhat from the secondary level—captured in the Šolar corpus—we upgraded the methodology accordingly. To further explore current feedback practices, we conducted a survey involving as many as 459 educators teaching at Slovenian public universities and independent higher education institutions. The survey results were presented to the public, and the research data were published in open access.

ROZMAN, Tadeja, ARHAR HOLDT, Špela, STABEJ, Marko. Podajanje povratnih informacij o študentskih besedilih: raziskovalni podatki = Feedback on student writing: research data. Ljubljana: [s. n.], 2023. Repozitorij Univerze v Ljubljani – RUL, DOI: 20.500.12556/RUL-152767. [COBISS.SI-ID 176945923]

ROZMAN, Tadeja, STABEJ, Marko. Univerzitetno pisanje in popravljanje besedil: prakse in stališča. V: ŠTUMBERGER, Saška (ur.). Predpis in norma v jeziku. Ljubljana: Založba Univerze, 2024. Str. 285-292. Zbirka Obdobja, 43. DOI: 10.4312/Obdobja.43.285-292. [COBISS.SI-ID 215458307]

ROZMAN, Tadeja. Slovenske empirične raziskave o razvoju strokovne sporazumevalne zmožnosti na univerzi: kje smo in kako naprej?. V: KOVAČEVIĆ, Borko (ur.). Modern approaches to old and new challenges: book of abstracts: the 8th international congress of Applied Linguistics Today: Faculty of Philology, University of Belgrade: 23–25 May 2025. Belgrade: University, Faculty of Philology, 2025. Str. 161-162. https://alt8.fil.bg.ac.rs/bookOfAbstracts. [COBISS.SI-ID 245534211]

Quantitative and qualitative linguistic analyses of student writing

For both quantitative and qualitative analyses of school writing, high-quality and reliable annotation of language corrections in student texts is essential. In this project, we improved the annotation methodology, particularly by upgrading the Šolar annotation scheme and the annotation tool CJVT Svala. These advancements were presented at the established national symposium Obdobja and at the international LREC conference.

ARHAR HOLDT, Špela, LAVRIČ, Polona, ROBLEK, Rebeka, GOLI, Teja, BON, Mija, 2023: Categorizing Teachers’ Corrections: Guidelines for Annotating the Šolar Corpus. Version 1.2. Prepared in the project Empirical foundations for digitally-supported development of writing skills. https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/annotation-guidelines.

ARHAR HOLDT, Špela, ERJAVEC, Tomaž, KOSEM, Iztok, VOLODINA, Elena. Towards an ideal tool for learner error annotation. V: CALZOLARI, Nicoletta (ur.). The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): main conference proceedings: 20-25 May, 2024, Torino, Italia. [Paris]: ELRA Language Resources Association (ELRA); [Stroudsburg]: International Committee on Computational Linguistics, cop. 2024. Str. 16392-16398. https://aclanthology.org/2024.lrec-main.1424.pdf. [COBISS.SI-ID 199958019]. POSTER PDF

ARHAR HOLDT, Špela, POPIČ, Damjan, STRITAR KUČUK, Mojca. Primerjava sistemov za označevanje jezikovnih popravkov v štirih slovenskih besedilnih korpusih. V: ŠTUMBERGER, Saška (ur.). Predpis in norma v jeziku. Ljubljana: Založba Univerze, 2024. Str. 11-20. Zbirka Obdobja, 43. DOI: 10.4312/Obdobja.43.11-20. [COBISS.SI-ID 215306243]

Using advanced data extraction from the Šolar 3.0 corpus, we compiled a frequency list of language issues containing 36,570 sentences from student writing, each corrected by a teacher. The corrections were manually categorized into 180 distinct types based on their content. Each sentence is annotated with metadata such as the type of source text, the educational level of the author, and the type and region of the school where the text was produced. The dataset reveals which issues teachers at various educational levels focus on most, how they correct them, which problems are most frequent, and which are regionally conditioned. We conducted statistical analyses of the data and presented the most persistent language difficulties—those that remain present in student writing up to the end of secondary school—at the TALC conference. For the qualitative linguistic analysis, which focused both on typical writing difficulties and correction patterns, we selected two topics: comma usage and related errors, and language variants in the use of multi-word expressions.

ARHAR HOLDT, Špela, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, ŽAGAR, Aleš, KOSEM, Iztok. Frequency list of language problems from Šolar 3.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1716. [COBISS.SI-ID 130413571]

ARHAR HOLDT, Špela. Leveraging frequency list of language problems from Šolar 3.0. V: TaLC 2024: 16th Teaching and Language Corpora Conference: July 7th to 10th 2024, Manchester Metropolitan University, Manchester, UK: book of abstracts. [Manchester: Manchester Metropolitan University], 2024. Str. [121]. https://talc2024.co.uk/wp-content/uploads/2024/07/book-of-abstracts-talc-2024_final-4.pdf. [COBISS.SI-ID 204247555] POSTER PDF

BON, Mija, GAPSA, Magdalena. Analiza napak pri rabi vejice v šolskih spisih. V: MARUŠIČ, Franc (ur.), et al. Škrabčevi dnevi 13: zbornik prispevkov s simpozija 2023. Nova Gorica. Nova Gorica: Založba univerze, 2025. Str. 1–15. https://ung.si/media/publishing/2025/03/12/08/24/17/Zbornik-SD13-2025-koncna.pdf. [COBISS.SI-ID 247838723]

GANTAR, Polona, BON, Mija. Dati skozi ali prestati?: Napake in jezikovne variante v rabi večbesednih enot pri samostojnem tvorjenju besedil v osnovni in srednji šoli. Sodobna pedagogika, okt. 2025, letn. 76 = 142, št. 3, str. 39-58, DOI: 10.63384/sptB5_z789s. [COBISS.SI-ID 259193859]

Empirical data for developmental indicators on levels of vocabulary and syntax

We established a methodology for extracting core vocabulary lists from pedagogical corpora, which included updating the corpus extraction tool LIST to version 1.3. We also developed and described a methodology for extracting syntactic information from pedagogical corpora. We generated frequency lists of lemmas from the textbook corpus and compiled core vocabulary lists for levels A1, A2, and B1, based on the Common European Framework of Reference for Languages (CEFR). Special attention was given to vocabulary at the A1 level, for which we developed a lexical description concept that includes both authentic and pedagogically adapted corpus examples and collocations. All resources and tools were published openly on the CLARIN.SI repository, and the results were presented at an international conference.

KRSNIK, Luka, ARHAR HOLDT, Špela, ČIBEJ, Jaka, DOBROVOLJC, Kaja, KLJUČEVŠEK, Aleksander, KREK, Simon, ROBNIK ŠIKONJA, Marko. Corpus extraction tool LIST 1.3. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1964. [COBISS.SI-ID 218014211]

KOSEM, Iztok, PORI, Eva, ARHAR HOLDT, Špela. Frequency list of textbook vocabulary by level of education in elementary and secondary schools. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1719. [COBISS.SI-ID 192040707]

KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja. Core vocabulary for Slovenian as L2 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1697. [COBISS.SI-ID 130844419]

PORI, Eva, KNEZ, Mihaela, KOSEM, Iztok, ARHAR HOLDT, Špela, KLEMEN, Matej, GANTAR, Polona, ZGAGA, Karolina, ROBLEK, Rebeka. A1 core vocabulary with lexical information for Slovenian 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1896. [COBISS.SI-ID 192040963]

KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, PORI, Eva, GANTAR, Polona, KNEZ, Mihaela. Building a CEFR-labeled core vocabulary and developing a lexical resource for Slovenian as a second and foreign language. V: MEDVEĎ, Marek (ur.), et al. eLex 2023: electronic lexicography in the 21st century (eLex 2023): proceedings of the eLex 2023 conference: [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 654-668. Electronic lexicography in the 21st century, https://elex.link/elex2023/wp-content/uploads/118.pdf. [COBISS.SI-ID 158856451]

For syntactic studies, we developed a methodology for extracting syntactic information from pedagogical corpora. We published openly accessible frequency lists of collocations from the Šolar 3.0 corpus and the Učbeniki 1.0 corpus in the CLARIN.SI repository, as well as frequency lists of syntactic structures from both corpora. These data can serve as a basis for developing empirically grounded developmental benchmarks, descriptors, and other materials for learning Slovene. As part of the project, we also prepared two linguistic analyses comparing the characteristics of student writing and textbooks across educational levels, focusing on both vocabulary and syntax.

MUNDA, Tina, ARHAR HOLDT, Špela, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, KOSEM, Iztok. Frequency list of collocations from the Šolar 3.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2011. [COBISS.SI-ID 225465859]

MUNDA, Tina, ARHAR HOLDT, Špela, KOSEM, Iztok, PORI, Eva, KREK, Simon. Frequency list of collocations from the Učbeniki 1.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2012. [COBISS.SI-ID 225461251]

MUNDA, Tina, ARHAR HOLDT, Špela, DOBROVOLJC, Kaja, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, KOSEM, Iztok. Frequency lists of syntactic structures from the Šolar 3.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2009. [COBISS.SI-ID 225469443]

MUNDA, Tina, ARHAR HOLDT, Špela, DOBROVOLJC, Kaja, KOSEM, Iztok, PORI, Eva, KREK, Simon. Frequency lists of syntactic structures from the Učbeniki 1.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2010. [COBISS.SI-ID 225467395]

MUNDA, Tina, ARHAR HOLDT, Špela. Na poti k skladenjskim analizam šolskega pisanja: skladenjski vzorci v korpusu Šolar 3.0. V: ARHAR HOLDT, Špela (ur.), ERJAVEC, Tomaž (ur.). Jezikovne tehnologije in digitalna humanistika: zbornik konference: 19.-20. september 2024, Ljubljana, Slovenija = Language technologies and digital humanities: proceedings of the conference: 19-20 September 2024, Ljubljana, Slovenia. Ljubljana: Inštitut za novejšo zgodovino: = Institute of Contemporary History, 2024. Str. 577-588. https://zenodo.org/records/13912515. [COBISS.SI-ID 212016387] POSTER PDF

MUNDA, Tina, ARHAR HOLDT, Špela. First Insights into the Syntax of Slovene Student Writing: A Statistical Analysis of Šolar 3.0 vs. Učbeniki 1.0. In Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025), str. 105–114, Ljubljana, Slovenia. Association for Computational Linguistics. https://aclanthology.org/2025.quasy-1.13/

KOSEM, Iztok, PORI, Eva. Prvi koraki do seznama temeljnega šolskega besedišča. Sodobna pedagogika, okt. 2025, letn. 76 = 142, št. 3, str. 9-38, DOI: 10.63384/sptB53z794s. [COBISS.SI-ID 259189507]

Practice-based digitally-supported development of writing skills

Questionnaire survey about teaching practices used to develop writing skills

We conducted a large-scale study among teachers who guide the production of written texts through language corrections and other forms of feedback. We investigated practices related to the correction of written texts, including: how much time participants devote to correction on a weekly or monthly basis; what types of feedback they provide; the format of texts and corrections (written or digital); which tools and resources they use for correction; and which aspects of their current practices they consider most problematic. The questionnaire was prepared in two language versions—Slovene and English—which will enable comparable studies in other countries. The project collected a total of 1,024 valid responses, including 609 fully completed questionnaires. The results were statistically analyzed, and the appropriately anonymized research data were published in open access. The findings were presented to both the research and teaching communities, and a journal article is in preparation.

Validated survey questionnaire in Slovene and English.

ROT VRHOVEC, Alenka, ARHAR HOLDT, Špela, PIŽORN, Karmen, GODEC SORŠAK, Lara. Popravljanje pisnih besedil učencev/dijakov: raziskovalni podatki. Ljubljana: Pedagoška fakulteta: Filozofska fakulteta, 2024. https://repozitorij.uni-lj.si/IzpisGradiva.php?lang=slv&id=153481. [COBISS.SI-ID 206148867]

ROT VRHOVEC, Alenka. Kako učitelji popravljajo besedila učencev/dijakov?: predavanje na [konferenci] Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi, Fakulteta za upravo, Univerza v Ljubljani, 5. 4. 2023. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/500/832/9274. [COBISS.SI-ID 149745667]

GODEC SORŠAK, Lara, ROT VRHOVEC, Alenka. Popravljanje pisnih besedil učencev in dijakov pri različnih predmetih: vpogled v rezultate ankete učiteljev. Sodobna pedagogika, okt. 2025, letn. 76, št. 3, str. 59–85, DOI: 10.63384/sptB53z792s. [COBISS.SI-ID 257058819]

Recording teacher corrections of student writing and semi-structured interviews

This study focused on the use of existing digital tools for providing feedback to students. We recruited 18 Slovene language teachers from different types of schools (6 from primary schools, 6 from vocational and technical schools, and 6 from gymnasiums). As part of the study, they corrected two pre-selected authentic student texts, during which we recorded their screen, minimal work environment (face), and think-aloud commentary. This independent work with texts was followed by interviews, where we explored the characteristics of their work, the capabilities and limitations of correction tools, and the participants’ wishes regarding additional functionalities. The results were transcribed, and the appropriately anonymized data were published in open access. An article for a scientific journal is also in preparation, presenting the results together with evaluations of automated error correction conducted as part of the work package Development and evaluation of automatic identification and categorization of language problems (see below).

Teacher interview questionnaire in Slovene and English.

ARHAR HOLDT, Špela, MUNDA, Tina. Učiteljsko popravljanje šolskih besedil v digitalnem okolju: intervjuji z učitelji slovenskih OŠ in SŠ. Ljubljana: Zaključena znanstvena zbirka raziskovalnih podatkov. 2025. https://repozitorij.uni-lj.si/IzpisGradiva.php?lang=slv&id=169549.

ARHAR HOLDT, Špela, MUNDA, Tina. Jezikovno popravljanje v digitalnem okolju: kvalitativna študija z učiteljicami in učitelji slovenščine. Sodobna pedagogika, okt. 2025, letn. 76 = 142, št. 3, str. 86-106, DOI: 10.63384/sptB53s791s. [COBISS.SI-ID 259204611]

Designing a strategy for digitally-supported development of writing skills

An increasingly important part of contemporary language use takes place in digital environments. However, the integration of digital media into teaching must be age-appropriate, inclusive for all groups of learners, and effective in order to avoid unnecessary screen time. During the course of the project, generative artificial intelligence tools for text co-creation also became easily accessible, raising numerous new questions. The strategy we developed—in dialogue with the guiding principles for the renewal of Slovene language curricula in primary and secondary schools—emphasizes the need for the school system to be prepared for the impact of generative AI. It also underscores the importance of open educational materials and thoughtfully designed, problem-oriented digital solutions.

Teachers must be empowered not only to use but also to co-create digital linguistic resources, tools, and technologies. Shifts in communication practices—such as increasing informality and anonymity—call for new understandings and approaches to language education, including the development of stylistic awareness and the ability to critically evaluate texts. We also highlight the need for a thorough reform of Slovene teacher education, so that in the future, teachers can implement new approaches to teaching and assessment more effectively and inclusively.

The strategy was published in a thematic issue of the scientific journal Jezik in slovstvo, dedicated to curriculum reform. Selected topics were also presented in a panel discussion marking the publication of the thematic issue, and at the professional conference “Correcting Language and Texts – Teacher Feedback in School Practice.”

ARHAR HOLDT, Špela, FERBEŽAR, Ina, KALIN GOLOB, Monika, KREK, Simon, PAVLE, Andreja, ROZMAN, Tadeja, STABEJ, Marko. Nova slovenščina. Jezik in slovstvo. [Tiskana izd.]. 2024, letn. 69, št. 3, str. 117-138. DOI: 10.4312/jis.69.3.117-138. [COBISS.SI-ID 210323971]

ŽBOGAR, Alenka, AHAČIČ, Kozma, HARAMIJA, Dragica, MIKOLIČ, Vesna, STABEJ, Marko, TIVADAR, Hotimir. Slovenščina v šoli: izzivi in priložnosti slovenščine kot materinščine na primarni in sekundarni stopnji vzgoje in izobraževanja: okrogla miza Oddelka za slovenistiko Filozofske fakultete ob izidu tematske številke revije Jezik in slovstvo (69/3, 2024), posvečene prenovi učnih načrtov za slovenščino v osnovnih in srednjih šolah v okviru Tedna Univerze v Ljubljani, Filozofska fakulteta, Ljubljana, 3. 12. 2024. [COBISS.SI-ID 219169539]

STABEJ, Marko. Kdo ali kaj naj koga ali kaj, kako in zakaj?: (prosti spis o popravljanju in povratni informaciji). V: PORI, Eva (ur.), ARHAR HOLDT, Špela (ur.). Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi: zbornik konference. Ljubljana: Založba Univerze, 2023. Str. 22-24. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/500/832/9312. [COBISS.SI-ID 193563139]

Development and evaluation of automatic identification and categorisation of language problems

Designing a model for automatic error annotation

We explored the potential of new methodologies for the automatic correction of Slovene texts. The machine learning models are based on large pre-trained language models, which we adapted to the task of correcting student writing using selected authentic and synthetically prepared datasets. We first examined the applicability of the language models multilingual BERT, CroSloEngual BERT, and SloBERTa. For the task of machine question answering, we tested several pre-trained encoder-decoder models of the T5 type. T5-type models were also tested for the automatic generation of the correct form of misspelt words, and we investigated the effectiveness of procedures for machine-generated explanations. Using an optimized neural methodology, we addressed spelling, orthographic, morphological, and syntactic errors, achieving particularly strong results for the first two categories. We then developed SloNSpell, a neural spellchecker for Slovene that currently delivers the best results for the language. Its key advantage is the ability to detect not only traditional spelling errors but also instances where a misspelling results in a legitimate word form.

The various stages of model development were documented in conference and journal publications.

ULČAR, Matej, ROBNIK ŠIKONJA, Marko. Sequence-to-sequence pretraining for a less-resourced Slovenian language. Frontiers in artificial intelligence. Mar. 2023, vol. 6, str. 1-13, DOI: 10.3389/frai.2023.932519. [COBISS.SI-ID 147683587]

KMECL, Tim, ROBNIK ŠIKONJA, Marko. Logično sklepanje v naravnem jeziku za slovenščino. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2024, letn. 12, št. 1, str. 1-53, DOI: 10.4312/slo2.0.2024.1.1-53. [COBISS.SI-ID 206551299]

LOGAR, Katja, ROBNIK ŠIKONJA, Marko. Unified question answering in Slovene. V: LUŠTREK, Mitja (ur.), GAMS, Matjaž (ur.), PILTAVER, Rok (ur.). Slovenska konferenca o umetni inteligenci = Slovenian Conference on Artificial Intelligence: Informacijska družba – IS 2022 = Information Society – IS 2022: zbornik 25. mednarodne multikonference = proceedings of the 25th international multiconference: zvezek A = volume A: 11. oktober 2022, 11 October 2022, Ljubljana, Slovenija. Ljubljana: Institut “Jožef Stefan”, 2022. Str. 23-26, https://doi.org/10.48550/arXiv.2211.09159. [COBISS.SI-ID 129718275]

KLEMEN, Matej, BOŽIČ, Martin, ARHAR HOLDT, Špela, ROBNIK ŠIKONJA, Marko. Neural spell-checker: beyond words with synthetic data generation. V: NÖTH, Elmar (ur.), HORÁK, Aleš (ur.), SOJKA, Petr (ur.). Text, speech, and dialogue. Part 1: 27th International Conference, TSD 2024, Brno, Czech Republic, September 9–13, 2024: proceedings. Cham: Springer, cop. 2024. Str. 85-96. Lecture notes in computer science, SL7, Lecture notes in artificial intelligence, DOI: 10.1007/978-3-031-70563-2_7, dostopno na https://doi.org/10.48550/arXiv.2410.23514. [COBISS.SI-ID 213519107]

PETRIČ, Timotej, ARHAR HOLDT, Špela, ROBNIK ŠIKONJA, Marko. Pomembnost realistične evalvacije: primer popravkov sklona in števila v slovenščini z velikim jezikovnim modelom. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2024, letn. 12, št. 1, str. 106-130, DOI: https://doi.org/10.4312/slo2.0.2024.1.106-130. [COBISS.SI-ID 227633411]

KLEMEN, Matej, BOŽIČ, Martin, ARHAR HOLDT, Špela, ROBNIK ŠIKONJA, Marko. Grammatical error correction of Slovenian school essays using large language models. Sodobna pedagogika, okt. 2025, letn. 76, št. 3, str. 162–176, DOI: 10.63384/sptB53z793a. [COBISS.SI-ID 259208195]

We published the modules for different linguistic levels under an open license on the HuggingFace platform, enabling their further use and development.

Modul that identifies orthographic and spelling errors.

Modul that corrects orthographic and spelling errors.

Modul that corrects morphosyntactic erros.

Modul that corrects word order errors.

Testing the applicability of the methodology to other languages

Following the development of large language models and technologies such as ChatGPT, it became evident that there is a critical lack of well-designed and high-quality evaluation datasets that would enable reliable assessment of neural approaches for various tasks, including grammatical error correction (GEC). This is particularly true for Slovene and other low-resource languages, where the amount of available text and structured linguistic resources is limited. At the same time, it has been shown that cross-lingual approaches work well for less-resourced languages, as large language models transfer (linguistic) knowledge across all languages they have been trained on. For this reason, we joined the MultiGEC-2025 shared task, under which uniformly designed evaluation datasets for grammatical error correction were developed for 12 languages. Developers participating in the task competed in automatic grammatical correction across all included languages—including Slovene—and reported on the performance of various approaches. The dataset is publicly available for further use and will be integrated into future activities. Our participation in the shared task was presented in a technical report and a scientific article.

MASCIOLINI, Arianna, ARHAR HOLDT, Špela, ŽAGAR, Aleš, et al. An overview of grammatical error correction for the twelve MultiGEC-2025 languages. Göteborg: Faculty of Humanities, Department of Swedish, Multilingualism, Language Technology, 2025. GU-ISS Forskningsrapporter från Institutionen för svenska, flerspråkighet och språkteknologi (2011-), ISSN 1401-5919, https://gupea.ub.gu.se/bitstream/handle/2077/84800/2025_MultiGEC_GEC_overview.pdf?sequence=1&isAllowed=y. [COBISS.SI-ID 232510723]

MASCIOLINI, Arianna, CAINES, Andrew, ARHAR HOLDT, Špela, ŽAGAR, Aleš, et al. Towards better language representation in natural language processing: a multilingual dataset for text-level grammatical error correction. International journal of learner corpus research, 2025, vol. 11, iss. 2, pp. 309-335, ISSN 2215-1478, DOI: 10.1075/ijlcr.24033.mas. [COBISS.SI-ID 234594051]

In cooperation with other research projects, several articles were produced focusing on the development of language resources, the evaluation of large language models for less-resourced languages, the creation of specialized tools for processing Slovene texts, and the testing of cross-lingual approaches for various NLP tasks. These studies are indirectly connected to the development of grammar correction methods for Slovene, as they establish methodological and data-related frameworks essential for the effective use of large language models in processing and generating Slovene texts.

YADAV, Anjali, GARG, Tanya, KLEMEN, Matej, ULČAR, Matej, AGARWAL, Basant, ROBNIK ŠIKONJA, Marko. From translation to generative LLMs: classification of code-mixed affective tasks. IEEE transactions on affective computing. 2025, vol. 12, str. pp. 2090-2101. DOI: 10.1109/TAFFC.2025.3553399. [COBISS.SI-ID 232748291]

ULČAR, Matej, ŽAGAR, Aleš, ARMENDARIZ, Carlos S., REPAR, Andraž, POLLAK, Senja, PURVER, Matthew, ROBNIK ŠIKONJA, Marko. Mono- and cross-lingual evaluation of representation language models on less-resourced languages. Computer speech & language. Jan. 2026, vol. 95, [article no.] 101852, 1-29, DiRROS – Digitalni repozitorij raziskovalnih organizacij Slovenije, https://doi.org/10.1016/j.csl.2025.101852. [COBISS.SI-ID 241622275]

ŽAGAR, Aleš, KLEMEN, Matej, KOSEM, Iztok, ROBNIK ŠIKONJA, Marko. SENTA: sentence simplification system for Slovene. V: CALZOLARI, Nicoletta (ur.). The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): main conference proceedings: 20-25 May, 2024, Torino, Italia. [Paris]: ELRA Language Resources Association (ELRA); [Stroudsburg]: International Committee on Computational Linguistics, cop. 2024. Str. 14687-14692, https://aclanthology.org/2024.lrec-main.1279.pdf. [COBISS.SI-ID 197916675]

MIOK, Kristian, HIDALGO TENORIO, Encarnación, OSENOVA, Petja, BENÍTEZ-CASTRO, Miguel-Ángel, ROBNIK ŠIKONJA, Marko. Multi-aspect multilingual and cross-lingual parliamentary speech analysis. Intelligent data analysis. [Print ed.]. Feb. 2024, vol. 28, no. 1, str. 239-260, https://doi.org/10.3233/IDA-227347. [COBISS.SI-ID 178091523]

KLEMEN, Matej, ŽAGAR, Aleš, ČIBEJ, Jaka, ROBNIK ŠIKONJA, Marko. SI-NLI: a Slovene natural language inference dataset and its evaluation. V: CALZOLARI, Nicoletta (ur.). The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): main conference proceedings: 20-25 May, 2024, Torino, Italia. [Paris]: ELRA Language Resources Association (ELRA); [Stroudsburg]: International Committee on Computational Linguistics, cop. 2024. Str. 14859-14870, https://aclanthology.org/2024.lrec-main.1294.pdf. [COBISS.SI-ID 197916931]

ĐOKOVIĆ, Lazar, ROBNIK ŠIKONJA, Marko. Sarcasm detection in a less-resourced language. V: LUŠTREK, Mitja (ur.), GAMS, Matjaž (ur.), PILTAVER, Rok (ur.). Slovenian Conference on Artificial Intelligence. Vol. A : proceedings of the 27th International Multiconference Information Society – IS 2024 : 10–11 October 2024, Ljubljana, Slovenia. Ljubljana: Institut “Jožef Stefan”, 2024. Str. 19-22. Informacijska družba. https://is.ijs.si/wp-content/uploads/2024/11/IS2024_Volume-A.pdf. [COBISS.SI-ID 216268291]

ŽAGAR, Aleš, ROBNIK ŠIKONJA, Marko. One model to rule them all: ranking Slovene summarizers. V: EKŠTEIN, Kamil (ur.), PÁRTL, František (ur.), KONOPÍK, Miloslav (ur.). Text, speech, and dialogue: 26th International Conference, TSD 2023, Pilsen, Czech Republic, September 4–6, 2023 : proceedings. Cham: Springer, cop. 2023. Str. 15-24. Lecture notes in computer science (Internet), Lecture notes in artificial intelligence, 14102. https://link.springer.com/chapter/10.1007/978-3-031-40498-6_2. [COBISS.SI-ID 165084419]

Linguistic and teacher evaluation of automatic annotation of texts

We developed a reference dataset for both quantitative and qualitative evaluation of automatic grammatical error correction. The dataset is based on the Šolar 3.0 corpus, but instead of using teacher corrections—which are often adapted to the learner’s developmental level and can vary in form—it contains consistently and systematically annotated corrections. The new dataset, named Šolar-Eval 1.0, includes 109 texts produced in Slovene primary and secondary schools. The texts were linguistically analyzed, and 9,808 language issues were manually annotated across multiple linguistic levels. The dataset has been published under an open license in the CLARIN.SI repository and described in a peer-reviewed article. Šolar-Eval 1.0 was used for both machine and linguistic evaluation of the models developed within the project, with results reported on the HuggingFace platform and in the articles listed above.

ARHAR HOLDT, Špela, GANTAR, Polona, BON, Mija, GAPSA, Magdalena, LAVRIČ, Polona, KLEMEN, Matej. Dataset for evaluation of Slovene spell- and grammar-checking tools Šolar-Eval 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1902. [COBISS.SI-ID 185626115]

GANTAR, Polona, BON, Mija, GAPSA, Magdalena, ARHAR HOLDT, Špela. Šolar-Eval: evalvacijska množica za strojno popravljanje jezikovnih napak v slovenskih besedilih. Jezik in slovstvo. [Tiskana izd.]. 2023, letn. 68, št. 4, pp. 89-108, DOI: 10.4312/jis.68.4.89-108. [COBISS.SI-ID 187559683]

We conducted the teacher evaluation with the same team that participated in the work package Practice-based digitally-supported development of writing skills: 18 teachers of Slovene from different types of schools. In the study, participants examined two authentically produced school texts that had been automatically corrected by the system and presented in a simple interface. During the evaluation, we recorded their screen activity, a minimal view of their working environment (face), and their think-aloud commentary. The independent evaluation of the automatic corrections was followed by interviews focusing on the capabilities and limitations of the developed correction models, as well as participants’ wishes regarding additional functionalities. The results of the study were transcribed and, after appropriate anonymization, published in the RUL repository. The findings, including the evaluation report and specifications for further tool development, were published in a scientific journal. In addition, the tool for automatic comma placement in Slovene was evaluated separately with a group of students, and the results were presented at a conference.

Teacher evaluation questionnaire in Slovene and English.

We evaluated the comma correction tool CJVT Vejice among students and presented the results at a conference.

GODEC SORŠAK, Lara. Raba vejice v pisnih besedilih študentov in uporabnost spletnega orodja Vejice 1.0. V: ŠTUMBERGER, Saška (ur.). Predpis in norma v jeziku. Ljubljana: Založba Univerze, 2024. Str. 103-111. Zbirka Obdobja, 43. DOI: 10.4312/Obdobja.43.103-111. [COBISS.SI-ID 215559939]

Providing feedback in digital environment

Developing a model combining formative assessment and crowdsourcing

In this project activity, we reviewed literature and tools related to digital monitoring of writing competence, with a particular focus on the effects of automated written corrective feedback provided by digital writing support tools. A systematic review of 22 studies showed that these tools are especially beneficial due to their fast, accurate, and varied feedback formats, with hybrid approaches—combining automated tools with teacher support—proving most effective. We also addressed the challenges faced by students with specific learning difficulties in higher education, highlighting the importance of digitally supported learning and UDL to enhance accessibility, equity, and flexibility in pedagogical approaches.

PIŽORN, Karmen, LEMUT BAJEC, Melita. Systematic review of digital writing assistants in EFL writing instruction. Sodobna pedagogika, okt. 2025, letn. 76, št. 3, str. 141–161, DOI: 10.63384/sptB5_z796a. [COBISS.SI-ID 256918531]

KOŠAK BABUDER, Milena, POREDOŠ, Mojca, PIŽORN, Karmen. Digitalno podprto učenje študentov s specifičnimi učnimi težavami v visokošolskem izobraževanju. Sodobna pedagogika, okt. 2025, letn. 76, št. 3, str. 107–125, 177-197, DOI: 10.63384/sptB53s795as. [COBISS.SI-ID 256927747]

We explored new solutions in digital collaborative practices between teachers and learners. As a model for linking formative assessment with crowdsourcing, we implemented a gamified crowdsourcing approach for reviewing and validating pedagogically appropriate corpus examples, which serve as the foundation for learning materials, exercises, and assessments. The aim is to save time through collaborative content creation and to develop a large, openly accessible, and carefully reviewed collection of language examples. The game, named CrowLL (Crowdsourcing for Language Learning), supports Slovene and several other languages and was presented at conferences and in a peer-reviewed journal article.

ZINGANO KUHN, Tanara, ARHAR HOLDT, Špela, KOSEM, Iztok, TIBERIUS, Carole, KOPPEL, Kristina, ZVIEL-GIRSHIN, Rina. Data preparation in crowdsourcing for pedagogical purposes : the case of the CrowLL game. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2022, letn. 10, št. 2, str. 62-100, DOI: 10.4312/slo2.0.2022.2.62-100. [COBISS.SI-ID 146362883]

ZINGANO KUHN, Tanara, TIBERIUS, Carole, ARHAR HOLDT, Špela, KOPPEL, Kristina, KOSEM, Iztok, ZVIEL-GIRSHIN, Rina, LUÍS, Ana R. Developing manually annotated corpora for teaching and learning purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene (the CrowLL Project). V: LINDÉN, Krister (ur.), NIEMI, Jyrki (ur.), KONTINO, Thalassia (ur.). CLARIN annual conference proceedings 2023: 16 – 18 October 2023 Leuven, Belgium. [S. l.: s. n.], 2023. Str. 173-177. CLARIN Annual Conference Proceedings. https://office.clarin.eu/v/CE-2023-2328_CLARIN2023_ConferenceProceedings.pdf. [COBISS.SI-ID 200002819]

ZINGANO KUHN, Tanara, KOPPEL, Kristina, ARHAR HOLDT, Špela, TIBERIUS, Carole, ZVIEL-GIRSHIN, Rina, KOSEM, Iztok. Annotating corpora for language learning and lexicography with the Crowdsourcing for Language Learning (CrowLL) game. V: MEDVEĎ, Marek (ur.), et al. eLex 2023 : electronic lexicography in the 21st century (eLex 2023): invisible lexicography: book of abstracts : Brno, 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 13-14. https://elex.link/elex2023/wp-content/uploads/elex2023_book_of_abstracts.pdf. [COBISS.SI-ID 184965379]

We also investigated the attitudes of the teacher community toward crowdsourcing, using the Thesaurus of Modern Slovene—the first Slovene thesaurus to include user-contributed synonym candidates—as a case study. Although the thesaurus has strong potential for use in education, no prior studies had examined how dictionary users—especially Slovene teachers—evaluate user participation compared to the lexicographers who designed the resource. Our results show that teachers consider user-contributed synonyms to be both relevant and useful. At the same time, the findings emphasize the importance of involving users not only as data contributors but also as evaluators in the development of language resources.

GAPSA, Magdalena, ARHAR HOLDT, Špela. How lexicographers evaluate user contributions in the Thesaurus of Modern Slovene in comparison to dictionary users. V: MEDVEĎ, Marek (ur.), et al. eLex 2023: electronic lexicography in the 21st century (eLex 2023): proceedings of the eLex 2023 conference : [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 178-200. Electronic lexicography in the 21st century. Proceedings of eLex … conference. https://elex.link/elex2023/wp-content/uploads/47.pdf. [COBISS.SI-ID 162928387]

GAPSA, Magdalena (2024). Učiteljske ocene uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine. Jezik in Slovstvo, 69(4), 35-50. https://doi.org/10.4312/jis.69.4.35-50

Corpus-based and scaffolded feedback scenarios

Text corpora and responsive lexical resources provide data on real-life contemporary language use across broader contexts and various communicative situations, making them a fundamental reference for literacy education. In the project, we analysed the challenges and limitations of corpus-based responsive dictionaries—specifically, the Thesaurus of Modern Slovene and the Collocations Dictionary of Modern Slovene—for use in educational settings, and introduced key improvements. For the Šolar 3.0 corpus, which serves as the basis for corpus-based study of language problems and corrections in student writing, we developed a powerful and specialized concordancer interface. This significantly improved access to corpus data for a broader audience—for example, teachers preparing teaching materials and exercises, as well as students training to become language educators. The new concordancer allows for targeted searches of specific linguistic corrections across various language levels, and supports examination of how teacher feedback differs across educational stages, school types, and Slovene regions. These innovations were presented at academic conferences, and in a journal article.

KOSEM, Iztok, ARHAR HOLDT, Špela, GANTAR, Polona, KREK, Simon. Collocations Dictionary of Modern Slovene 2.0. V: MEDVEĎ, Marek (ur.), et al. eLex 2023 : electronic lexicography in the 21st century (eLex 2023) : proceedings of the eLex 2023 conference : [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 491-507, ilustr. Electronic lexicography in the 21st century. Proceedings of eLex … conference. ISSN 2533-5626. https://elex.link/elex2023/wp-content/uploads/100.pdf. [COBISS.SI-ID 158852867]

ARHAR HOLDT, Špela, GANTAR, Polona, KOSEM, Iztok, PORI, Eva, ROBNIK ŠIKONJA, Marko, KREK, Simon. Thesaurus of Modern Slovene 2.0. V: MEDVEĎ, Marek (ur.), et al. eLex 2023 : electronic lexicography in the 21st century (eLex 2023) : proceedings of the eLex 2023 conference : [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 366-381, ilustr. Electronic lexicography in the 21st century. Proceedings of eLex … conference. ISSN 2533-5626. https://elex.link/elex2023/wp-content/uploads/82.pdf. [COBISS.SI-ID 158818819

ARHAR HOLDT, Špela, KOSEM, Iztok, STRITAR KUČUK, Mojca. Developing a specialised concordancer for corpora with language corrections. V: TaLC 2024 : 16th Teaching and Language Corpora Conference : July 7th to 10th 2024, Manchester Metropolitan University, Manchester, UK : book of abstracts. [Manchester: Manchester Metropolitan University], 2024. Str. [77]. https://talc2024.co.uk/wp-content/uploads/2024/07/book-of-abstracts-talc-2024_final-4.pdf. [COBISS.SI-ID 204245507]

KOSEM, Iztok, STRITAR KUČUK, Mojca, ARHAR HOLDT, Špela. Corplus: a new concordancer for exploring authentic texts with language corrections. Journal of responsible technology. Mar. 2026, vol. 25, [article no.] 100144, str. 1-9, DOI: 10.1016/j.jrt.2025.100144. [COBISS.SI-ID 263757059]

Corpora and corpus-based resources for Slovene—such as Sloleks 2.0, Sopomenke 2.0, Kolokacije 2.0, the reference corpus Gigafida 2.0, and Šolar 3.0—can be used in various ways: by directing students to specific entries in a language resource; by integrating and displaying selected datasets within a digital tool; or by incorporating explanations, examples, and exercises based on linguistic analyses. With the emergence of generative artificial intelligence, it is now also important to consider machine-generated feedback, provided it supports teachers and aligns with their expectations and needs. We presented AI-supported selection of pedagogically appropriate corpus examples as a model for future applications at a recent international conference.

KOSEM, Iztok, ZINGANO KUHN, Tanara, ARHAR HOLDT, Špela, KOPPEL, Kristina, TIBERIUS, Carole, ZVIEL-GIRSHIN, Rina. Examining the potential of AI in the annotation of corpus examples for language learning. In: CILC2024: XV Congreso Internacional de Lingüística de Corpus, Las Palmas de Gran Canaria, España = 15th International Corpus Linguistics Conference, Las Palmas de Gran Canaria, Spain, 22–24 May 2024: [book of abstracts]. [S. l.]: Aelinco, 2024. pp. 93–95. https://drive.google.com/file/d/1rHS4OwztEPvYOPwHE5Mxn-lnK2ErbiHV/view. [COBISS.SI-ID 199984643]

Testing feedback with target user groups

Project participants from the teaching community expressed a desire for a professional conference where they could share their experiences, practices, and perspectives on providing feedback with a broader audience of educators. As the project’s first public event, we therefore organized a professional conference titled Correcting Language and Texts – Teacher Feedback in School Practice. The event was very well attended, and as a result, a conference volume was compiled, featuring 25 peer-reviewed contributions by teachers addressing topics such as language correction, feedback, formative assessment, and other themes relevant to the project.

PORI, Eva (urednik), ARHAR HOLDT, Špela (urednik). Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi: zbornik konference. 1. izd. Ljubljana: Založba Univerze, 2023. Spletni vir, 374 strani, DOI: 10.4312/9789612972394. [COBISS.SI-ID 178525187]

Questions regarding the preferred form of digital feedback were included in teacher interviews described in the activity Linguistic and Teacher Evaluation of Automatic Annotation of Texts. We explored preferences for the visualisation of feedback in digital tools, with results showing the importance of didactically appropriate presentation. Particularly valued were links to relevant language resources and the integration of statistics on textual features. However, an excessive number of corrections may demotivate students, while automated corrections may reduce their active engagement. Teachers thus emphasised the need for graded correction options. This concept of graded feedback was interpreted in different ways: as adjustable strictness of corrections for different user groups, the possibility of choosing between automatic corrections or simple colour highlighting of errors, or even staged correction across language levels—for instance, from structural issues to orthographic ones. The findings specifying user needs and preferences for future model upgrades were published in a peer-reviewed journal.

Project events and lectures

Teacher conference

On April 5, 2023, we organized the professional conference Correcting Language and Texts – Teacher Feedback in School Practice (Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi) at the Faculty of Public Administration, University of Ljubljana.

You can view the programme and event photos at this link.
The conference proceedings volume with peer-reviewed contributions is available here.

Teacher training

On November 24, 2023, a teacher training session was held at the Faculty of Arts, University of Ljubljana, organized by the Department of Slovene Studies. Among the topics presented was Preparing Teaching Materials Using the Šolar 3.0 Corpus of Student Writing (Priprava učnih gradiv s korpusom šolskih pisnih izdelkov Šolar 3.0). The following materials are available below:

slides (PDF)
guidelines for annotating language corrections in the Šolar corpus (PDF)
frequency list of language problems from the Šolar corpus (XLSX)
CJVT concordancer demo (povezava)
noSketch Engine concordancer on Clarin.si (povezava)

Invited lectures

ARHAR HOLDT, Špela. Leveraging error-annotated corpora and the Svala Tool: the case of Slovene: guest talk at Department of Swedish, Multilingualism, Language Technology, University of Gothenburg, Sweden, 20 June 2023. [COBISS.SI-ID 171445507]
ARHAR HOLDT, Špela. A specialised concordancer for corpora with annotated language corrections: invited presentation at Department of Swedish, Multilingualism, and Language Technology, University of Gothenburg, 23rd of April 2024, Gothenburg, Sweden. [COBISS.SI-ID 214135299]
ARHAR HOLDT, Špela. From developmental corpus to developed applications: the journey of Šolar 3.0: presentation at the Faculty of Arts and Humanities of the University of Coimbra, Coimbra, Portugal, 29 Nov. 2024. [COBISS.SI-ID 229553923]
KOSEM, Iztok. Corpus tools and language resources at the University of Ljubljana: purposes and people behind the development: presentation at the Faculty of Arts and Humanities of the University of Coimbra, Coimbra, Portugal, 29 Nov. 2024. [COBISS.SI-ID 229749251]

Final Project Event

The final event of the research project took place on 23 September 2025 in Zbornična dvorana of the University of Ljubljana. During the event, the project team presented an overview of key outcomes, selected studies, resources, and tools developed throughout the project. All lectures will be made available on the VideoLectures portal.

Programme:

10:30–10:40  Welcome and opening remarks by the Dean of the Faculty of Arts, University of Ljubljana, Prof. Dr. Mojca Schlamberger Brezar [VIDEO]
10:40–11:00  Špela Arhar Holdt: Results of the project “Empirical Foundations for Digitally-Supported Development of Writing Skills” [VIDEO]
11:00–11:15  Iztok Kosem: First steps towards a core vocabulary list for school use [VIDEO]
11:15–11:30  Tadeja Rozman: The student writing corpus KOŠ [VIDEO]
11:30–11:45  Matej Klemen: Automatic correction of language errors using language models: from data to solutions [VIDEO]
11:45–12:00 Break
12:00–12:15  Alenka Rot Vrhovec: Insights from the teacher survey on the correction of student writing [VIDEO]
12:15–12:30  Tina Munda: Language correction in the digital environment: a qualitative study with Slovene language teachers [VIDEO]
12:30–12:45  Karmen Pižorn: The role of digital tools in developing EFL writing skills: a systematic literature review [VIDEO]
12:45–13:00  Milena Košak Babuder: Digitally-supported learning among students with specific learning difficulties [VIDEO]
13:00–14:00  Audience discussion, refreshments, and networking

CONTACT

LOCATION

INFO

S PODPORO