Project scope
The aim of the project Empirical foundations for digitally-supported development of writing skills (PROP) is to support teachers who correct and grade student writing. The development of writing competency involves practising writing skills – however, more writing also means more work for teachers. Research has shown that giving individualised, goal-oriented and formative feedback leads to best literacy results. On the other hand, it is time consuming and demands support in the form of adequate descriptors, indicators and, not least, information about modern language use in various communication situations. In many languages, including Slovene, these conditions are not met, which is why there is (too) little school writing, while feedback often remains limited to surface corrections of errors, such as grammatical mistakes.
We believe the solution lies in digital support of teachers’ work. On the one hand, automatic identification and substantive categorisation of grammatical errors would free teachers from routine corrections and give them more time to pursue advanced teaching objectives. On the other, a digitally-supported model of providing feedback, based on empirically founded indicators and descriptors, would ease the preparation of corrective study materials and allow for peer assessment and long-term development monitoring. Advances in the field of natural language processing, machine learning, and corpus linguistics make this an attainable goal, which is attested by a variety of digital tools, prototypes, and portals currently being designed. The innovative aspect of the project, which necessitates interdisciplinary collaboration, lies in its proposal of solutions that are based on empirical data: authentic teacher practices coupled with data on real, modern language use.
Pertaining to the latter, Slovene may have a relative advantage over other languages, since it already possesses a corpus of student texts from Slovene primary and secondary schools, which also includes teacher corrections categorised by type of language problem. This corpus represents untapped potential for empirical analyses of authentic school written production and language corrections and for development of a tool that would automatically categorise language problems based on real-life principles of teacher correction.
During the course of the project, we will use automatic extraction and analyses of richly annotated corpora to collect empirical data needed for specifying developmental indicators and descriptors for various educational stages. We will then use this information to design feedback scenarios for different language levels: spelling, morphology, vocabulary, and syntax. We will expand the corpus of school texts by including examples of student writing from the tertiary level and empirically research the specifics of providing feedback in higher education settings. Next, we will develop a tool that automatically identifies language problems in a given text, taking into account the level of the writer. Given the rarity of language resources such as corpus Šolar, we will also test the performance of the tool on other comparable training datasets, adapt it to be more independent and thus make the methodology applicable to other languages.
User research is a key part of our project: we will examine existing practices of providing teacher feedback in the development of writing skills, first by conducting a web survey and then by recording teachers’ screens while they are correcting in a digital environment. Teachers and students will furthermore conduct user evaluations of the solutions, developed in the project. Lastly, we intend to combine findings in formative assessment with crowdsourcing and apply this to language didactics in order to develop a strategy of digitally-supported development of writing skills, which will take into consideration all the necessary didactical and ethical issues related to the field.
Project goals
Corpora and corpus data: richly annotate a corpus of student writing, a corpus of school textbooks, and a corpus of literature aimed at youngsters and young adults; use data extraction and corpus analyses to facilitate empirical foundations for developmental indicators and descriptors for various educational stages; use these results to develop feedback scenarios on the levels of spelling, morphology, vocabulary, and syntax; build a pilot corpus of student academic writing and include it in all of the above steps.
Software module: develop a software module which automatically identifies and categorises language errors on different language levels; adapt the software to teachers’ needs and the specifics of providing feedback at different educational stages; create the foundation for applying the methodology to other languages.
User research: empirically research existing teaching practices of providing feedback for the development of writing skills with the aid of a) an online survey and b) screen recording of teacher corrections in a digital environment; create the basis for comparable research in other languages; include teachers/lecturers in the evaluation of the software module and teachers/lecturers and pupils/students in the evaluation of the corpus-based feedback scenarios.
Models and strategies: combine findings from the fields of formative assessment and crowdsourcing for the needs of language education and create a model for providing digitally-supported feedback to help the development of writing skills; form a strategy of digitally-supported development of writing skills.
Dissemination of research results: ensure that the results are published in keeping with the National strategy for open access to scientific publications and research data in Slovenia; inform the scientific and general public about project results (scientific publications, events, website) and encourage further exploitation of the results.
Project group
University of Ljubljana, Faculty of Arts
– Špela Arhar Holdt, 27674
– Iztok Kosem, 33796
– Polona Gantar, 16313
– Marko Stabej, 11651
– Teja Goli, 52176
– Magdalena Gapsa, 53628
– Mija Bon, 51891
University of Ljubljana, Faculty of Computer and Information Science
– Marko Robnik-Šikonja, 15295
– Simon Krek, 26166
– Matej Ulčar, 55173
– Aleš Žagar, 56007
– Matej Klemen, 55754
University of Ljubljana, Faculty of Public Administration
– Tadeja Rozman, 25578
University of Ljubljana, Faculty of Education
– Karmen Pižorn, 21612
– Alenka Rot Vrhovec, 34816
– Lara Godec Soršak, 25590
– Milena Košak Babuder, 26199
– Tomaž Petek, 32433
Project timeline
1. Corpus analysis of written production at various educational stages
– Corpus data preparation for linguistic and machine tasks [M1-6]
– Compiling a pilot corpus of student academic texts [M1-12]
– Quantitative and qualitative linguistic analyses of student writing [M1-18]
– Empirical data for developmental indicators on levels of vocabulary and syntax [M7-18]
2. Practice-based digitally-supported development of writing skills
– Questionnaire survey about teaching practices used to develop writing skills [M1-18]
– Recording teacher corrections of student writing and semi-structured interviews [M10-24]
– Designing a strategy for digitally-supported development of writing skills [M19-35]
3. Development and evaluation of automatic identification and categorisation of language problems
– Designing a model for automatic error annotation [M7-35]
– Testing the applicability of methodology to other languages [M25-35]
– Linguistic and teacher evaluation of automatic annotation of texts [M13-35]
4. Providing feedback in digital environment
– Developing a model combining formative assessment and crowdsourcing [M7-18]
– Corpus-based and scaffolded feedback scenarios [M13-24]
– Testing feedback with target user groups [M22-35]
5. Coordination and dissemination
– Coordination, reporting and dissemination [M1-36]
– Scientific publications and research data [M1-36]
Corpus analysis of written production at various educational stages
Corpus data preparation for linguistic and machine tasks
As part of the project, we prepared three text corpora specialized for educational use: the corpus of student writing Šolar 3.0, the open-access school textbook corpus ccUčbeniki 1.0, and the youth literature corpus ccMaks 1.0. Prior to the project, Šolar was available in version 2.0, Maks was only accessible via concordancers (not as a full database), and the Učbeniki corpus was not publicly available. All three resources had previously been linguistically annotated, but using older automatic tagging tools. In this project, we uniformly re-annotated the corpora using the state-of-the-art CLASSLA v1.1.1 tagger, covering tokenization, sentence segmentation, lemmatization, and morphosyntactic tagging according to the MULTEXT-East v6 standard, as well as dependency syntax (JOS) and named entity recognition. In collaboration with other projects, we also ensured that two additional corpora for Slovene as a second language were prepared using the same standards: the KUUS textbook corpus and the KOST corpus of Slovene as a foreign language. This uniformity allows for more reliable data comparison, while higher-level linguistic annotations support more advanced linguistic and computational analyses, as well as better data usability for machine learning applications.
The corpora are openly available in the CLARIN.SI repository:
ARHAR HOLDT, Špela, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, KOSEM, Iztok. Developmental corpus Šolar 3.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1589. [COBISS.SI-ID 124160003]
KOSEM, Iztok, PORI, Eva, ŽAGAR, Aleš, ARHAR HOLDT, Špela. Corpus of Slovenian textbooks ccUčbeniki 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1693. [COBISS.SI-ID 129443843]
VERDONIK, Darinka, MAJNINGER, Sandi, DOBROVOLJC, Kaja, ANTLOGA, Špela, ZÖGLING MARKUŠ, Aleksandra, VORŠIČ, Ines, ZEMLJAK JONTES, Melita, KOLETNIK, Mihaela, VALH LOPERT, Alenka, ŠEK, Polonca, KOSEM, Iztok, MAJHENIČ, Simona, FERME, Marko, ŽAGAR, Aleš, ARHAR HOLDT, Špela. Corpus of Slovenian texts for pedagogical purposes ccMAKS 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1692. [COBISS.SI-ID 129467395]
KLEMEN, Matej, KOSEM, Iztok, ARHAR HOLDT, Špela, POLLAK, Senja, HUBER, Damjan, LUTAR, Mateja. Corpus of textbooks for learning Slovenian as L2 KUUS 2.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1877. [COBISS.SI-ID 178103555]
STRITAR KUČUK, Mojca, ŠTER, Helena, PISEK, Staša, PETRIC LASNIK, Ivana, KETE MATIČIČ, Jana, PIRIH SVETINA, Nataša, PREGLAU, Daniela, ARHAR HOLDT, Špela, KRSNIK, Luka, ERJAVEC, Tomaž, PEGAN, Jasmina, HUBER, Damjan. Slovene learner corpus KOST 2.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1887. [COBISS.SI-ID 181138179]
The corpus preparation process—which is particularly demanding for corpora containing linguistic corrections, such as Šolar 3.0—was presented at conferences, in a monograph, and in the prestigious journal Language Resources and Evaluation.
ARHAR HOLDT, Špela, KOSEM, Iztok. Šolar, the developmental corpus of Slovene. Language resources and evaluation. 2024, str. 1-27. DOI: 10.1007/s10579-024-09758-4. [COBISS.SI-ID 204228867]
ARHAR HOLDT, Špela, KOSEM, Iztok, STRITAR KUČUK, Mojca. Metode in orodja za lažjo pripravo korpusov usvajanja jezika. V: PIRIH SVETINA, Nataša (ur.), FERBEŽAR, Ina (ur.). Na stičišču svetov : slovenščina kot drugi in tuji jezik. 1. natis. Ljubljana: Založba Univerze, 2022. Str. 23-30. Zbirka Obdobja, 41. DOI: 10.4312/Obdobja.41.23-30. [COBISS.SI-ID 129063939]
ARHAR HOLDT, Špela, PORI, Eva, KOSEM, Iztok. Prihodnost korpusa Šolar. V: ARHAR HOLDT, Špela (ur.), KREK, Simon (ur.). Razvoj slovenščine v digitalnem okolju. Ljubljana: Založba Univerze, 2023. Str. 61-91. Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/522/852/9442. [COBISS.SI-ID 185543683]
KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, HUBER, Damjan, LUTAR, Mateja. Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. V: PIRIH SVETINA, Nataša (ur.), FERBEŽAR, Ina (ur.). Na stičišču svetov : slovenščina kot drugi in tuji jezik. Ljubljana: Založba Univerze, 2022. Str. 165-174. Zbirka Obdobja, 41. DOI: 10.4312/Obdobja.41.165-174. [COBISS.SI-ID 129975811]
Compiling a pilot corpus of student academic texts
We developed a new pilot corpus of student writing, KOŠ, which includes written texts by students from the Faculty of Public Administration and the Faculty of Education at the University of Ljubljana. The corpus contains 293 texts (297,422 words of student writing and 1,091 words of teacher commentary). The texts were collected following the Šolar corpus preparation methodology, which involves recording all relevant metadata, including teacher-provided language corrections, ensuring legal compliance for open access, and formatting the data in a compatible format.
Text collection agreement for students: digital signature / manual signature.
Text collection agreement for professors: digital signature / manual signature.
ROZMAN, Tadeja, ARHAR HOLDT, Špela. Gradnja Korpusa študentskih besedil KOŠ. V: FIŠER, Darja (ur.), ERJAVEC, Tomaž (ur.). Jezikovne tehnologije in digitalna humanistika: zbornik konference: 15.-16. september 2022, Ljubljana, Slovenija. Ljubljana: Inštitut za novejšo zgodovino, 2022. Str. 267-270. https://nl.ijs.si/jtdh22/pdf/JTDH2022_Rozman_ArharHoldt_Gradnja-Korpusa-studentskih-besedil-KOS.pdf. [COBISS.SI-ID 131012099] Posnetek predstavitve.
Since writing and the provision of feedback at the tertiary level differ somewhat from the secondary level—captured in the Šolar corpus—we upgraded the methodology accordingly. To further explore current feedback practices, we conducted a survey involving as many as 459 educators teaching at Slovenian public universities and independent higher education institutions. The survey results were presented to the public, and the research data were published in open access.
ROZMAN, Tadeja, ARHAR HOLDT, Špela, STABEJ, Marko. Podajanje povratnih informacij o študentskih besedilih: raziskovalni podatki = Feedback on student writing: research data. Ljubljana: [s. n.], 2023. Repozitorij Univerze v Ljubljani – RUL, DOI: 20.500.12556/RUL-152767. [COBISS.SI-ID 176945923]
ROZMAN, Tadeja, STABEJ, Marko. Univerzitetno pisanje in popravljanje besedil: prakse in stališča. V: ŠTUMBERGER, Saška (ur.). Predpis in norma v jeziku. Ljubljana: Založba Univerze, 2024. Str. 285-292. Zbirka Obdobja, 43. DOI: 10.4312/Obdobja.43.285-292. [COBISS.SI-ID 215458307]
Quantitative and qualitative linguistic analyses of student writing
For both quantitative and qualitative analyses of school writing, high-quality and reliable annotation of language corrections in student texts is essential. In this project, we improved the annotation methodology, particularly by upgrading the Šolar annotation scheme and the annotation tool CVJT Svala. These advancements were presented at the established national symposium Obdobja and at the international LREC conference.
ARHAR HOLDT, Špela, LAVRIČ, Polona, ROBLEK, Rebeka, GOLI, Teja, BON, Mija, 2023: Categorizing Teachers’ Corrections: Guidelines for Annotating the Šolar Corpus. Version 1.2. Prepared in the project Empirical foundations for digitally-supported development of writing skills. https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/annotation-guidelines.
ARHAR HOLDT, Špela, ERJAVEC, Tomaž, KOSEM, Iztok, VOLODINA, Elena. Towards an ideal tool for learner error annotation. V: CALZOLARI, Nicoletta (ur.). The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): main conference proceedings: 20-25 May, 2024, Torino, Italia. [Paris]: ELRA Language Resources Association (ELRA); [Stroudsburg]: International Committee on Computational Linguistics, cop. 2024. Str. 16392-16398. https://aclanthology.org/2024.lrec-main.1424.pdf. [COBISS.SI-ID 199958019].
ARHAR HOLDT, Špela, POPIČ, Damjan, STRITAR KUČUK, Mojca. Primerjava sistemov za označevanje jezikovnih popravkov v štirih slovenskih besedilnih korpusih. V: ŠTUMBERGER, Saška (ur.). Predpis in norma v jeziku. Ljubljana: Založba Univerze, 2024. Str. 11-20. Zbirka Obdobja, 43. DOI: 10.4312/Obdobja.43.11-20. [COBISS.SI-ID 215306243]
Using advanced data extraction from the Šolar 3.0 corpus, we compiled a frequency list of language issues containing 36,570 sentences from student writing, each corrected by a teacher. The corrections were manually categorized into 180 distinct types based on their content. Each sentence is annotated with metadata such as the type of source text, the educational level of the author, and the type and region of the school where the text was produced. The dataset reveals which issues teachers at various educational levels focus on most, how they correct them, which problems are most frequent, and which are regionally conditioned. We conducted statistical analyses of the data and presented the most persistent language difficulties—those that remain present in student writing up to the end of secondary school—at the TALC conference. We also analyzed and presented issues related to comma usage in school essays.
ARHAR HOLDT, Špela, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, ŽAGAR, Aleš, KOSEM, Iztok. Frequency list of language problems from Šolar 3.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1716. [COBISS.SI-ID 130413571]
ARHAR HOLDT, Špela. Leveraging frequency list of language problems from Šolar 3.0. V: TaLC 2024: 16th Teaching and Language Corpora Conference: July 7th to 10th 2024, Manchester Metropolitan University, Manchester, UK: book of abstracts. [Manchester: Manchester Metropolitan University], 2024. Str. [121]. https://talc2024.co.uk/wp-content/uploads/2024/07/book-of-abstracts-talc-2024_final-4.pdf. [COBISS.SI-ID 204247555]
BON, Mija, GAPSA, Magdalena. Analiza napak pri rabi vejice v šolskih spisih. V: MARUŠIČ, Franc (ur.), et al. Škrabčevi dnevi 13: knjižica povzetkov simpozija 2023: 20. oktober 2023, Nova Gorica. Nova Gorica: Univerza v Novi Gorici, 2023. Str. 3. https://skrabcevi-dnevi.zrc-sazu.si/Portals/19/Povzetki/SD-13-Povzetki-simpozija-2023.pdf. [COBISS.SI-ID 207455235]
Empirical data for developmental indicators on levels of vocabulary and syntax
We established a methodology for extracting core vocabulary lists from pedagogical corpora, which included updating the corpus extraction tool LIST to version 1.3. We also developed and described a methodology for extracting syntactic information from pedagogical corpora. We generated frequency lists of lemmas from the textbook corpus and compiled core vocabulary lists for levels A1, A2, and B1, based on the Common European Framework of Reference for Languages (CEFR). Special attention was given to vocabulary at the A1 level, for which we developed a lexical description concept that includes both authentic and pedagogically adapted corpus examples and collocations. All resources and tools were published openly on the CLARIN.SI repository, and the results were presented at both national and international conferences.
KRSNIK, Luka, ARHAR HOLDT, Špela, ČIBEJ, Jaka, DOBROVOLJC, Kaja, KLJUČEVŠEK, Aleksander, KREK, Simon, ROBNIK ŠIKONJA, Marko. Corpus extraction tool LIST 1.3. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1964. [COBISS.SI-ID 218014211]
KOSEM, Iztok, PORI, Eva, ARHAR HOLDT, Špela. Frequency list of textbook vocabulary by level of education in elementary and secondary schools. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1719. [COBISS.SI-ID 192040707]
KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja. Core vocabulary for Slovenian as L2 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1697. [COBISS.SI-ID 130844419]
PORI, Eva, KNEZ, Mihaela, KOSEM, Iztok, ARHAR HOLDT, Špela, KLEMEN, Matej, GANTAR, Polona, ZGAGA, Karolina, ROBLEK, Rebeka. A1 core vocabulary with lexical information for Slovenian 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1896. [COBISS.SI-ID 192040963]
KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, PORI, Eva, GANTAR, Polona, KNEZ, Mihaela. Building a CEFR-labeled core vocabulary and developing a lexical resource for Slovenian as a second and foreign language. V: MEDVEĎ, Marek (ur.), et al. eLex 2023: electronic lexicography in the 21st century (eLex 2023): proceedings of the eLex 2023 conference: [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 654-668. Electronic lexicography in the 21st century, https://elex.link/elex2023/wp-content/uploads/118.pdf. [COBISS.SI-ID 158856451]
For syntactic research, we developed a methodology for extracting syntactic information from pedagogical corpora. We published openly accessible frequency lists of collocations from the Šolar 3.0 corpus and the Učbeniki 1.0 corpus in the CLARIN.SI repository, as well as frequency lists of syntactic structures from both corpora. These data can serve as a basis for developing empirically grounded developmental benchmarks, descriptors, and other materials for learning Slovene. The methodology for syntactic extraction was also presented at a conference.
MUNDA, Tina, ARHAR HOLDT, Špela. Na poti k skladenjskim analizam šolskega pisanja: skladenjski vzorci v korpusu Šolar 3.0. V: ARHAR HOLDT, Špela (ur.), ERJAVEC, Tomaž (ur.). Jezikovne tehnologije in digitalna humanistika: zbornik konference: 19.-20. september 2024, Ljubljana, Slovenija = Language technologies and digital humanities: proceedings of the conference: 19-20 September 2024, Ljubljana, Slovenia. Ljubljana: Inštitut za novejšo zgodovino: = Institute of Contemporary History, 2024. Str. 577-588. https://zenodo.org/records/13912515. [COBISS.SI-ID 212016387]
MUNDA, Tina, ARHAR HOLDT, Špela, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, KOSEM, Iztok. Frequency list of collocations from the Šolar 3.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2011. [COBISS.SI-ID 225465859]
MUNDA, Tina, ARHAR HOLDT, Špela, KOSEM, Iztok, PORI, Eva, KREK, Simon. Frequency list of collocations from the Učbeniki 1.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2012. [COBISS.SI-ID 225461251]
MUNDA, Tina, ARHAR HOLDT, Špela, DOBROVOLJC, Kaja, ROZMAN, Tadeja, STRITAR KUČUK, Mojca, KREK, Simon, KRAPŠ VODOPIVEC, Irena, STABEJ, Marko, PORI, Eva, GOLI, Teja, LAVRIČ, Polona, LASKOWSKI, Cyprian Adam, KOCJANČIČ, Polonca, KLEMENC, Bojan, KRSNIK, Luka, KOSEM, Iztok. Frequency lists of syntactic structures from the Šolar 3.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2009. [COBISS.SI-ID 225469443]
MUNDA, Tina, ARHAR HOLDT, Špela, DOBROVOLJC, Kaja, KOSEM, Iztok, PORI, Eva, KREK, Simon. Frequency lists of syntactic structures from the Učbeniki 1.0 corpus. Ljubljana: University of Ljubljana, Centre for Language Resources and Technologies: University of Ljubljana, Faculty of Arts, 2025. CLARIN.SI data & tools. ISSN 2820-4042. http://hdl.handle.net/11356/2010. [COBISS.SI-ID 225467395]
Practice-based digitally-supported development of writing skills
Questionnaire survey about teaching practices used to develop writing skills
We conducted a large-scale study among teachers who guide the production of written texts through language corrections and other forms of feedback. We investigated practices related to the correction of written texts, including: how much time participants devote to correction on a weekly or monthly basis; what types of feedback they provide; the format of texts and corrections (written or digital); which tools and resources they use for correction; and which aspects of their current practices they consider most problematic. The questionnaire was prepared in two language versions—Slovene and English—which will enable comparable studies in other countries. The project collected a total of 1,024 valid responses, including 609 fully completed questionnaires. The results were statistically analyzed, and the appropriately anonymized research data were published in open access. The findings were presented to both the research and teaching communities, and a journal article is in preparation.
ROT VRHOVEC, Alenka, ARHAR HOLDT, Špela, PIŽORN, Karmen, GODEC SORŠAK, Lara. Popravljanje pisnih besedil učencev/dijakov: raziskovalni podatki. Ljubljana: Pedagoška fakulteta: Filozofska fakulteta, 2024. https://repozitorij.uni-lj.si/IzpisGradiva.php?lang=slv&id=153481. [COBISS.SI-ID 206148867]
ROT VRHOVEC, Alenka. Kako učitelji popravljajo besedila učencev/dijakov?: predavanje na [konferenci] Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi, Fakulteta za upravo, Univerza v Ljubljani, 5. 4. 2023. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/500/832/9274. [COBISS.SI-ID 149745667]
Recording teacher corrections of student writing and semi-structured interviews
This study focused on the use of existing digital tools for providing feedback to students. We recruited 18 Slovene language teachers from different types of schools (6 from primary schools, 6 from vocational and technical schools, and 6 from gymnasiums). As part of the study, they corrected two pre-selected authentic student texts, during which we recorded their screen, minimal work environment (face), and think-aloud commentary. This independent work with texts was followed by interviews, where we explored the characteristics of their work, the capabilities and limitations of correction tools, and the participants’ wishes regarding additional functionalities. The results were transcribed and thematically annotated, and the appropriately anonymized data will be published in open access. A journal article is also in preparation.
Designing a strategy for digitally-supported development of writing skills
An increasingly important part of contemporary language use takes place in digital environments. However, the integration of digital media into teaching must be age-appropriate, inclusive of all student groups, and effective in order to avoid unnecessary screen time. During the course of the project, generative artificial intelligence tools that co-create texts also became easily accessible, raising numerous new questions. The strategy we developed—aligned with the foundational principles for the renewal of Slovene language curricula in primary and secondary schools—emphasizes the need for the school system to be prepared for the impact of generative AI, and highlights the importance of open educational resources and thoughtful, problem-oriented digital solutions.
Teachers must be empowered not only to use but also to co-create digital linguistic resources, tools, and technologies. Changes in communication practices—such as increased informality and anonymity—require new understandings and approaches to language teaching, including the development of stylistic competence and critical evaluation skills. We also stress the need for a comprehensive reform of teacher education in Slovene, so that teachers can more effectively and inclusively implement new approaches to teaching and assessment in the future.
The strategy was published in a thematic issue of the scientific journal Jezik in slovstvo, dedicated to curriculum reform, and selected topics were also presented at a teacher conference on written language correction and feedback.
ARHAR HOLDT, Špela, FERBEŽAR, Ina, KALIN GOLOB, Monika, KREK, Simon, PAVLE, Andreja, ROZMAN, Tadeja, STABEJ, Marko. Nova slovenščina. Jezik in slovstvo. [Tiskana izd.]. 2024, letn. 69, št. 3, str. 117-138. DOI: 10.4312/jis.69.3.117-138. [COBISS.SI-ID 210323971]
STABEJ, Marko. Kdo ali kaj naj koga ali kaj, kako in zakaj?: (prosti spis o popravljanju in povratni informaciji). V: PORI, Eva (ur.), ARHAR HOLDT, Špela (ur.). Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi : zbornik konference. Ljubljana: Založba Univerze, 2023. Str. 22-24. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/500/832/9312. [COBISS.SI-ID 193563139]
Development and evaluation of automatic identification and categorisation of language problems
Designing a model for automatic error annotation
We explored the potential of new methodologies for the automatic correction of Slovene texts. The machine learning models are based on large pre-trained language models, which we adapted to the task of correcting student writing using selected authentic and synthetically prepared datasets. We first examined the applicability of the language models multilingual BERT, CroSloEngual BERT, and SloBERTa. For the task of machine question answering, we tested several pre-trained encoder-decoder models of the T5 type. T5-type models were also tested for the automatic generation of the correct form of misspelt words, and we investigated the effectiveness of procedures for machine-generated explanations. Using an optimized neural methodology, we addressed spelling, orthographic, morphological, and syntactic errors, achieving particularly strong results for the first two categories. We also developed a neural spellchecker, which currently provides state-of-the-art results for Slovene. Its key advantage is its ability to detect many cases where a spelling error results in a valid but unintended word form.
ULČAR, Matej, ROBNIK ŠIKONJA, Marko. Sequence-to-sequence pretraining for a less-resourced Slovenian language. Frontiers in artificial intelligence. Mar. 2023, vol. 6, str. 1-13, DOI: 10.3389/frai.2023.932519. [COBISS.SI-ID 147683587]
KMECL, Tim, ROBNIK ŠIKONJA, Marko. Logično sklepanje v naravnem jeziku za slovenščino. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2024, letn. 12, št. 1, str. 1-53, DOI: 10.4312/slo2.0.2024.1.1-53. [COBISS.SI-ID 206551299]
LOGAR, Katja, ROBNIK ŠIKONJA, Marko. Unified question answering in Slovene. V: LUŠTREK, Mitja (ur.), GAMS, Matjaž (ur.), PILTAVER, Rok (ur.). Slovenska konferenca o umetni inteligenci = Slovenian Conference on Artificial Intelligence: Informacijska družba – IS 2022 = Information Society – IS 2022: zbornik 25. mednarodne multikonference = proceedings of the 25th international multiconference: zvezek A = volume A: 11. oktober 2022, 11 October 2022, Ljubljana, Slovenija. Ljubljana: Institut “Jožef Stefan”, 2022. Str. 23-26, https://doi.org/10.48550/arXiv.2211.09159. [COBISS.SI-ID 129718275]
KLEMEN, Matej, BOŽIČ, Martin, ARHAR HOLDT, Špela, ROBNIK ŠIKONJA, Marko. Neural spell-checker: beyond words with synthetic data generation. V: NÖTH, Elmar (ur.), HORÁK, Aleš (ur.), SOJKA, Petr (ur.). Text, speech, and dialogue. Part 1: 27th International Conference, TSD 2024, Brno, Czech Republic, September 9–13, 2024: proceedings. Cham: Springer, cop. 2024. Str. 85-96. Lecture notes in computer science, SL7, Lecture notes in artificial intelligence, DOI: 10.1007/978-3-031-70563-2_7, dostopno na https://doi.org/10.48550/arXiv.2410.23514. [COBISS.SI-ID 213519107]
PETRIČ, Timotej, ARHAR HOLDT, Špela, ROBNIK ŠIKONJA, Marko. Pomembnost realistične evalvacije: primer popravkov sklona in števila v slovenščini z velikim jezikovnim modelom. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2024, letn. 12, št. 1, str. 106-130, DOI: https://doi.org/10.4312/slo2.0.2024.1.106-130. [COBISS.SI-ID 227633411]
We published the modules for different linguistic levels under an open license on the HuggingFace platform, enabling their further use and development.
Modul that identifies orthographic and spelling errors.
Modul that corrects orthographic and spelling errors.
Testing the applicability of methodology to other languages
The original plan of the project was to evaluate the performance of the above-mentioned models on selected error-annotated resources in the field of English as a foreign language. However, with the emergence of large language models such as ChatGPT, which are trained on vast amounts of English text, the focus shifted toward developing cross-lingual transfer methods between related languages or those spoken by smaller language communities, including Slovene. It became evident—not only for Slovene—that there is a critical lack of high-quality, consistently built evaluation datasets that would allow for reliable assessment of new approaches. We thus joined the MultiGEC-2025 shared task, which provided evaluation datasets for grammatical error correction in 12 languages in a unified format. Developers participating in the shared task built grammatical error correction systems for the included languages and reported on the effectiveness of different approaches. A research article describing the outcomes of this collaboration is currently in preparation.
Linguistic and teacher evaluation of automatic annotation of texts
We prepared a reference dataset for the quantitative and qualitative evaluation of automatic grammatical error correction. Šolar-Eval contains 109 texts produced in Slovene primary and secondary schools. The texts were linguistically analyzed, and 9,808 language issues across various linguistic levels were manually annotated. The dataset was published under an open license in the CLARIN.SI repository and described in a scientific article. Šolar-Eval was used for both computational and linguistic evaluation of the above-mentioned correction models, and the results were incorporated into the HuggingFace platform and the aforementioned publications.
ARHAR HOLDT, Špela, GANTAR, Polona, BON, Mija, GAPSA, Magdalena, LAVRIČ, Polona, KLEMEN, Matej. Dataset for evaluation of Slovene spell- and grammar-checking tools Šolar-Eval 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1902. [COBISS.SI-ID 185626115]
GANTAR, Polona, BON, Mija, GAPSA, Magdalena, ARHAR HOLDT, Špela. Šolar-Eval: evalvacijska množica za strojno popravljanje jezikovnih napak v slovenskih besedilih. Jezik in slovstvo. [Tiskana izd.]. 2023, letn. 68, št. 4, str. 89-108, DOI: 10.4312/jis.68.4.89-108. [COBISS.SI-ID 187559683]
For the teacher evaluation, we recruited 18 Slovene language teachers from various types of schools. In the study, they reviewed two machine-corrected authentic student texts using a simple interface, during which we recorded their screen, minimal work environment (face), and think-aloud commentary. After the independent evaluation of the machine corrections, we conducted interviews focusing on the capabilities and limitations of the correction models and the participants’ expectations regarding additional functionalities. The study results were transcribed and thematically annotated; the appropriately anonymized data will be published in open access.
We evaluated the comma correction tool CJVT Vejice among students and presented the results at a conference.
GODEC SORŠAK, Lara. Raba vejice v pisnih besedilih študentov in uporabnost spletnega orodja Vejice 1.0. V: ŠTUMBERGER, Saška (ur.). Predpis in norma v jeziku. Ljubljana: Založba Univerze, 2024. Str. 103-111. Zbirka Obdobja, 43. DOI: 10.4312/Obdobja.43.103-111. [COBISS.SI-ID 215559939]
Providing feedback in digital environment
Developing a model combining formative assessment and crowdsourcing
In this project activity, we reviewed the literature and existing applied solutions in the field of formative assessment for the development of writing competence, with a particular focus on opportunities for digitally supported learning based on this model. A review article on the topic is currently in preparation. We explored new solutions in the area of crowdsourcing and collaborative engagement between teachers and students. As a research case, we examined the integration of gamified crowdsourcing in the creation of authentic language examples, specifically reviewed and approved for pedagogical use—for instance, in the preparation of learning materials, exercises, and assessments. The aim is to save time through collaborative material development and to build a large, openly accessible, and properly reviewed collection of language examples. The game CrowLL (Crowdsourcing for Language Learning), which includes Slovene as well as other languages, was presented at conferences and in a scientific article.
ZINGANO KUHN, Tanara, ARHAR HOLDT, Špela, KOSEM, Iztok, TIBERIUS, Carole, KOPPEL, Kristina, ZVIEL-GIRSHIN, Rina. Data preparation in crowdsourcing for pedagogical purposes : the case of the CrowLL game. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2022, letn. 10, št. 2, str. 62-100, DOI: 10.4312/slo2.0.2022.2.62-100. [COBISS.SI-ID 146362883]
ZINGANO KUHN, Tanara, TIBERIUS, Carole, ARHAR HOLDT, Špela, KOPPEL, Kristina, KOSEM, Iztok, ZVIEL-GIRSHIN, Rina, LUÍS, Ana R. Developing manually annotated corpora for teaching and learning purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene (the CrowLL Project). V: LINDÉN, Krister (ur.), NIEMI, Jyrki (ur.), KONTINO, Thalassia (ur.). CLARIN annual conference proceedings 2023: 16 – 18 October 2023 Leuven, Belgium. [S. l.: s. n.], 2023. Str. 173-177. CLARIN Annual Conference Proceedings. https://office.clarin.eu/v/CE-2023-2328_CLARIN2023_ConferenceProceedings.pdf. [COBISS.SI-ID 200002819]
ZINGANO KUHN, Tanara, VORŠIČ, Ines, KOPPEL, Kristina, ARHAR HOLDT, Špela, TIBERIUS, Carole, ZVIEL-GIRSHIN, Rina, KOSEM, Iztok. Annotating corpora for language learning and lexicography with the Crowdsourcing for Language Learning (CrowLL) game. V: MEDVEĎ, Marek (ur.), et al. eLex 2023 : electronic lexicography in the 21st century (eLex 2023): invisible lexicography: book of abstracts : Brno, 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 13-14. https://elex.link/elex2023/wp-content/uploads/elex2023_book_of_abstracts.pdf. [COBISS.SI-ID 184965379]
We also explored the attitudes of the teaching community toward crowdsourcing, using Thesaurus of Modern Slovene—the first Slovene thesaurus to include data directly contributed by dictionary users. The thesaurus holds significant value for vocabulary acquisition, but until now, no research had been conducted on how this user participation is perceived by the users themselves (in comparison to the lexicographers who designed the thesaurus). The results show that dictionary users can provide useful and relevant synonym candidates—also according to Slovene language teachers. At the same time, it is crucial that users are also involved in the development of language resources as evaluators who can offer their feedback and suggestions.
GAPSA, Magdalena, ARHAR HOLDT, Špela. How lexicographers evaluate user contributions in the Thesaurus of Modern Slovene in comparison to dictionary users. V: MEDVEĎ, Marek (ur.), et al. eLex 2023: electronic lexicography in the 21st century (eLex 2023): proceedings of the eLex 2023 conference : [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 178-200. Electronic lexicography in the 21st century. Proceedings of eLex … conference. https://elex.link/elex2023/wp-content/uploads/47.pdf. [COBISS.SI-ID 162928387]
GAPSA, Magdalena (2024). Učiteljske ocene uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine. Jezik in Slovstvo, 69(4), 35-50. https://doi.org/10.4312/jis.69.4.35-50
Corpus-based and scaffolded feedback scenarios
Corpora and responsive language resources provide data on real, contemporary language use in broader contexts and diverse communicative situations, making them a fundamental reference for literacy development. In the project, we analyzed specific challenges related to corpus-based responsive dictionaries for school use and implemented significant improvements. The Šolar 3.0 corpus, which serves as a basis for corpus-driven research into language issues and corrections in student writing, was integrated into a powerful, specialized corpus interface. This significantly improved access to corpus material for a wider audience—for example, teachers preparing teaching materials, students training for the teaching profession, and others.
KOSEM, Iztok, ARHAR HOLDT, Špela, GANTAR, Polona, KREK, Simon. Collocations Dictionary of Modern Slovene 2.0. V: MEDVEĎ, Marek (ur.), et al. eLex 2023 : electronic lexicography in the 21st century (eLex 2023) : proceedings of the eLex 2023 conference : [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 491-507, ilustr. Electronic lexicography in the 21st century. Proceedings of eLex … conference. ISSN 2533-5626. https://elex.link/elex2023/wp-content/uploads/100.pdf. [COBISS.SI-ID 158852867]
ARHAR HOLDT, Špela, GANTAR, Polona, KOSEM, Iztok, PORI, Eva, ROBNIK ŠIKONJA, Marko, KREK, Simon. Thesaurus of Modern Slovene 2.0. V: MEDVEĎ, Marek (ur.), et al. eLex 2023 : electronic lexicography in the 21st century (eLex 2023) : proceedings of the eLex 2023 conference : [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 366-381, ilustr. Electronic lexicography in the 21st century. Proceedings of eLex … conference. ISSN 2533-5626. https://elex.link/elex2023/wp-content/uploads/82.pdf. [COBISS.SI-ID 158818819
ARHAR HOLDT, Špela, KOSEM, Iztok, STRITAR KUČUK, Mojca. Developing a specialised concordancer for corpora with language corrections. V: TaLC 2024 : 16th Teaching and Language Corpora Conference : July 7th to 10th 2024, Manchester Metropolitan University, Manchester, UK : book of abstracts. [Manchester: Manchester Metropolitan University], 2024. Str. [77]. https://talc2024.co.uk/wp-content/uploads/2024/07/book-of-abstracts-talc-2024_final-4.pdf. [COBISS.SI-ID 204245507]
Corpora and corpus-based resources for Slovene—such as Sloleks 2.0, Sopomenke 2.0, Kolokacije 2.0, the reference corpus Gigafida 2.0, and Šolar 3.0—can be used in various ways: by directing students to specific entries in a language resource; by integrating and displaying selected datasets within a digital tool; or by incorporating explanations, examples, and exercises based on linguistic analyses. With the emergence of generative artificial intelligence, it is also important to consider the possibility of machine-generated feedback—designed to support the teaching community in ways it considers beneficial. We presented the topic of using AI for selecting pedagogically appropriate language examples at a conference. Questions related to machine support were also included in the interviews conducted with teachers. A research article reporting the findings is currently in preparation.
KOSEM, Iztok, ZINGANO KUHN, Tanara, ARHAR HOLDT, Špela, KOPPEL, Kristina, TIBERIUS, Carole, ZVIEL-GIRSHIN, Rina. Examining the potential of AI in the annotation of corpus examples for language learning. In: CILC2024: XV Congreso Internacional de Lingüística de Corpus, Las Palmas de Gran Canaria, España = 15th International Corpus Linguistics Conference, Las Palmas de Gran Canaria, Spain, 22–24 May 2024: [book of abstracts]. [S. l.]: Aelinco, 2024. pp. 93–95. https://drive.google.com/file/d/1rHS4OwztEPvYOPwHE5Mxn-lnK2ErbiHV/view. [COBISS.SI-ID 199984643]
Testing feedback with target user groups
In the project proposal, we planned to test feedback provision through focus groups involving teachers and students. During the project, however, we revised this plan because (a) the originally proposed methodology for preparing test materials became outdated and ineffective following the broader emergence of generative artificial intelligence, and (b) members of the teaching community involved in the project expressed interest in a professional conference where they could share their experiences, practices, and views on feedback with a broader audience. As a result, we organized the conference Correcting Language and Texts – Teacher Feedback in School Practice. In connection with the event, we prepared a conference proceedings volume containing 25 peer-reviewed teacher contributions addressing correction, feedback, formative assessment, and other topics related to the project.
PORI, Eva (urednik), ARHAR HOLDT, Špela (urednik). Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi: zbornik konference. 1. izd. Ljubljana: Založba Univerze, 2023. Spletni vir, 374 strani, DOI: 10.4312/9789612972394. [COBISS.SI-ID 178525187]
Project events and lectures
Teacher conference
On April 5, 2023, we organized the professional conference Correcting Language and Texts – Teacher Feedback in School Practice (Popravljanje jezika in besedil – učiteljska povratna informacija v šolski praksi) at the Faculty of Public Administration, University of Ljubljana.
- You can view the programme and event photos at this link.
- The conference proceedings volume with peer-reviewed contributions is available here.
Teacher training
On November 24, 2023, a teacher training session was held at the Faculty of Arts, University of Ljubljana, organized by the Department of Slovene Studies. Among the topics presented was Preparing Teaching Materials Using the Šolar 3.0 Corpus of Student Writing (Priprava učnih gradiv s korpusom šolskih pisnih izdelkov Šolar 3.0). The following materials are available below:
- slides (PDF)
- guidelines for annotating language corrections in the Šolar corpus (PDF)
- frequency list of language problems from the Šolar corpus (XLSX)
- CJVT concordancer demo (povezava)
- noSketch Engine concordancer on Clarin.si (povezava)
Invited lectures
- ARHAR HOLDT, Špela. Leveraging error-annotated corpora and the Svala Tool: the case of Slovene: guest talk at Department of Swedish, Multilingualism, Language Technology, University of Gothenburg, Sweden, 20 June 2023. [COBISS.SI-ID 171445507]
- ARHAR HOLDT, Špela. A specialised concordancer for corpora with annotated language corrections: invited presentation at Department of Swedish, Multilingualism, and Language Technology, University of Gothenburg, 23rd of April 2024, Gothenburg, Sweden. [COBISS.SI-ID 214135299]
- ARHAR HOLDT, Špela. From developmental corpus to developed applications: the journey of Šolar 3.0: presentation at the Faculty of Arts and Humanities of the University of Coimbra, Coimbra, Portugal, 29 Nov. 2024. [COBISS.SI-ID 229553923]
- KOSEM, Iztok. Corpus tools and language resources at the University of Ljubljana: purposes and people behind the development: presentation at the Faculty of Arts and Humanities of the University of Coimbra, Coimbra, Portugal, 29 Nov. 2024. [COBISS.SI-ID 229749251]