The aim of the project Empirical foundations for digitally-supported development of writing skills (PROP) is to support teachers who correct and grade student writing. The development of writing competency involves practising writing skills – however, more writing also means more work for teachers. Research has shown that giving individualised, goal-oriented and formative feedback leads to best literacy results. On the other hand, it is time consuming and demands support in the form of adequate descriptors, indicators and, not least, information about modern language use in various communication situations. In many languages, including Slovene, these conditions are not met, which is why there is (too) little school writing, while feedback often remains limited to surface corrections of errors, such as grammatical mistakes.

We believe the solution lies in digital support of teachers’ work. On the one hand, automatic identification and substantive categorisation of grammatical errors would free teachers from routine corrections and give them more time to pursue advanced teaching objectives. On the other, a digitally-supported model of providing feedback, based on empirically founded indicators and descriptors, would ease the preparation of corrective study materials and allow for peer assessment and long-term development monitoring. Advances in the field of natural language processing, machine learning, and corpus linguistics make this an attainable goal, which is attested by a variety of digital tools, prototypes, and portals currently being designed. The innovative aspect of the project, which necessitates interdisciplinary collaboration, lies in its proposal of solutions that are based on empirical data: authentic teacher practices coupled with data on real, modern language use.

Pertaining to the latter, Slovene may have a relative advantage over other languages, since it already possesses a corpus of student texts from Slovene primary and secondary schools, which also includes teacher corrections categorised by type of language problem. This corpus represents untapped potential for empirical analyses of authentic school written production and language corrections and for development of a tool that would automatically categorise language problems based on real-life principles of teacher correction.

During the course of the project, we will use automatic extraction and analyses of richly annotated corpora to collect empirical data needed for specifying developmental indicators and descriptors for various educational stages. We will then use this information to design feedback scenarios for different language levels: spelling, morphology, vocabulary, and syntax. We will expand the corpus of school texts by including examples of student writing from the tertiary level and empirically research the specifics of providing feedback in higher education settings. Next, we will develop a tool that automatically identifies language problems in a given text, taking into account the level of the writer. Given the rarity of language resources such as corpus Šolar, we will also test the performance of the tool on other comparable training datasets, adapt it to be more independent and thus make the methodology applicable to other languages.

User research is a key part of our project: we will examine existing practices of providing teacher feedback in the development of writing skills, first by conducting a web survey and then by recording teachers’ screens while they are correcting in a digital environment. Teachers and students will furthermore conduct user evaluations of the solutions, developed in the project. Lastly, we intend to combine findings in formative assessment with crowdsourcing and apply this to language didactics in order to develop a strategy of digitally-supported development of writing skills, which will take into consideration all the necessary didactical and ethical issues related to the field.

Corpora and corpus data: richly annotate a corpus of student writing, a corpus of school textbooks, and a corpus of literature aimed at youngsters and young adults; use data extraction and corpus analyses to facilitate empirical foundations for developmental indicators and descriptors for various educational stages; use these results to develop feedback scenarios on the levels of spelling, morphology, vocabulary, and syntax; build a pilot corpus of student academic writing and include it in all of the above steps.

Software module: develop a software module which automatically identifies and categorises language errors on different language levels; adapt the software to teachers’ needs and the specifics of providing feedback at different educational stages; create the foundation for applying the methodology to other languages.

User research: empirically research existing teaching practices of providing feedback for the development of writing skills with the aid of a) an online survey and b) screen recording of teacher corrections in a digital environment; create the basis for comparable research in other languages; include teachers/lecturers in the evaluation of the software module and teachers/lecturers and pupils/students in the evaluation of the corpus-based feedback scenarios.

Models and strategies: combine findings from the fields of formative assessment and crowdsourcing for the needs of language education and create a model for providing digitally-supported feedback to help the development of writing skills; form a strategy of digitally-supported development of writing skills.

Dissemination of research results: ensure that the results are published in keeping with the National strategy for open access to scientific publications and research data in Slovenia; inform the scientific and general public about project results (scientific publications, events, website) and encourage further exploitation of the results.

1. Corpus analysis of written production at various educational stages
– Corpus data preparation for linguistic and machine tasks [M1-6]
– Compiling a pilot corpus of student academic texts [M1-12]
– Quantitative and qualitative linguistic analyses of student writing [M1-18]
– Empirical data for developmental indicators on levels of vocabulary and syntax [M7-18]
2. Practice-based digitally-supported development of writing skills
– Questionnaire survey about teaching practices used to develop writing skills [M1-18]
– Recording teacher corrections of student writing and semi-structured interviews [M10-24]
– Designing a strategy for digitally-supported development of writing skills [M19-35]
3. Development and evaluation of automatic identification and categorisation of language problems
– Designing a model for automatic error annotation [M7-35]
– Testing the applicability of methodology to other languages [M25-35]
– Linguistic and teacher evaluation of automatic annotation of texts [M13-35]
4. Providing feedback in digital environment
– Developing a model combining formative assessment and crowdsourcing [M7-18]
– Corpus-based and scaffolded feedback scenarios [M13-24]
– Testing feedback with target user groups [M22-35]
5. Coordination and dissemination
– Coordination, reporting and dissemination [M1-36]
– Scientific publications and research data [M1-36]

University of Ljubljana, Faculty of Arts
– Špela Arhar Holdt, 27674
– Iztok Kosem, 33796
– Polona Gantar, 16313
– Marko Stabej, 11651
– Teja Goli, 52176
– Magdalena Gapsa, 53628
– Mija Bon, 51891

University of Ljubljana, Faculty of Computer and Information Science
– Marko Robnik-Šikonja, 15295
– Simon Krek, 26166
– Matej Ulčar, 55173
– Aleš Žagar, 56007
– Matej Klemen, 55754

University of Ljubljana, Faculty of Public Administration
– Tadeja Rozman, 25578

University of Ljubljana, Faculty of Education
– Karmen Pižorn, 21612
– Alenka Rot Vrhovec, 34816
– Lara Godec Soršak, 25590
– Milena Košak Babuder, 26199
– Tomaž Petek, 32433

Corpus data preparation for linguistic and machine tasks

Three text corpora specialised for pedagogical use have been prepared for project research and the general public: the corpus of student texts Šolar 3.0, the corpus of textbooks ccUčbeniki 1.0, and the corpus of youth literature ccMaks 1.0.

Prior to the start of the project, the Šolar corpus was available in version 2.0, the Maks corpus was available through concordancers (but not as a database), and the Učbeniki corpus was not available to the general public. All three resources were linguistically tagged, but with a variety of tools.

In the project, the corpora were uniformly tagged with state-of-the-art tools for Slovene, which allows for a more reliable comparison between data, and on higher linguistic levels, which allows for advanced linguistic and automated analysis of the texts, as well as better use of the data for machine learning.

The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags, JOS dependency syntax, and named entities. The databases are openly available on the CLARIN.SI repository:

– Špela Arhar Holdt, Tadeja Rozman, Mojca Stritar Kučuk, Simon Krek, Irena Krapš Vodopivec, Marko Stabej, Eva Pori, Teja Goli, Polona Lavrič, Cyprian Laskowski, Polonca Kocjančič, Bojan Klemenc, Luka Krsnik, Iztok Kosem (2022). Developmental corpus Šolar 3.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1589.

– Iztok Kosem, Eva Pori, Aleš Žagar, Špela Arhar Holdt (2022). Corpus of Slovenian textbooks ccUčbeniki 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1693.

– Darinka Verdonik, Sandi Majninger, Kaja Dobrovoljc, Špela Antloga, Aleksandra Zögling Markuš, Ines Voršič, Melita Zemljak Jontes, Mihaela Koletnik, Alenka Valh Lopert, Polonca Šek Martük, Iztok Kosem, Simona Majhenič, Marko Ferme, Aleš Žagar, Špela Arhar Holdt (2022). Corpus of Slovenian texts for pedagogical purposes ccMAKS 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1692.

New developments in the methodology of creating learner and developmental corpora, such as Šolar, were presented at the Obdobja 2022 conference:

– Špela Arhar Holdt, Iztok Kosem, Mojca Stritar Kučuk (2022). Metode in orodja za lažjo pripravo korpusov usvajanja jezika. Pirih Svetina, Nataša (ed.), Ferbežar, Ina (ed.). Na stičišču svetov: slovenščina kot drugi in tuji jezik. Ljubljana: Založba Univerze, 23-30.

Compiling a pilot corpus of student academic texts

We are working on a pilot corpus containing texts from students at the tertiary level, together with professors’ feedback. The corpus is used to develop methodology for building such corpora, as we have presented at the JT-DH 2022 conference:

– Tadeja Rozman and Špela Arhar Holdt (2022). Gradnja Korpusa študentskih besedil KOŠ. Fišer, Darja (ed.), Erjavec, Tomaž (ed.). Jezikovne tehnologije in digitalna humanistika: zbornik konference: 15.-16. september 2022, Ljubljana, Slovenija. Ljubljana: Inštitut za novejšo zgodovino, 267-270. Video of the presentation.

By the end of 2022, we have collected 293 texts (297,422 words) from the Faculty of Public Administration and the Faculty of Education at the University of Ljubljana. To facilitate the structuring of the corpus, we have prepared a survey to explore the practices of text correction at the tertiary level. We have also addressed the legal issues pertaining to the collection of the material so that it can be made openly available to the public.

– Contract for students (in Slovene): digital signaturemanual signature.

– Contract for professors (in Slovene): digital signature / manual signature.

Quantitative and qualitative linguistic analyses of student writing

Using advanced data extraction from the Šolar 3.0 corpus, we have produced a frequency list of language problems containing 36,570 sentences of school writing with teacher error correction. The corrections are manually categorised according to their content into 180 different types. Each sentence is accompanied by metadata such as the type of source text, the educational level of the author, and the type and region of the school where the text was produced.

The purpose of the database is to facilitate access to corpus information for didactic purposes, statistical analysis of language problems in Slovenian schools, and for machine learning purposes.

– Špela Arhar Holdt, Tadeja Rozman, Mojca Stritar Kučuk, Simon Krek, Irena Krapš Vodopivec, Marko Stabej, Eva Pori, Teja Goli, Polona Lavrič, Cyprian Laskowski, Polonca Kocjančič, Bojan Klemenc, Luka Krsnik, Aleš Žagar, Iztok Kosem (2022). Frequency list of language problems from Šolar 3.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1716.

Empirical data for developmental indicators on levels of vocabulary and syntax

We have established a methodology for creating core vocabulary lists from pedagogical corpora. In inter-project cooperation, we tested it in the field of Slovene as a second/foreign language. The result is a corpus-based vocabulary list for levels A1, A2, and B1 according to the Common European Framework of Reference for Languages (CEFR), which is presented in the paper below.

– Matej Klemen, Špela Arhar Holdt, Senja Pollak, Iztok Kosem, Damjan Huber, Mateja Lutar (2022). Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. Pirih Svetina, Nataša (ed.), Ferbežar, Ina (ed.). Na stičišču svetov: slovenščina kot drugi in tuji jezik. Ljubljana: Založba Univerze, 165-174.

Questionnaire survey about teaching practices used to develop writing skills

We have developed a questionnaire for teachers in primary and secondary schools who are guiding the production of written texts. It contains 50 questions and is divided into demographic and content sections: on revising texts, on resources, language aids, and tools used by teachers, on feedback to be given to learners, on monitoring learners’ progress in writing skills development, and on revising texts written by learners with disabilities. The Slovene survey is ongoing, while an English version of the questionnaire has been prepared for international replicability.

– Validated survey questionnaire in Slovene and English.

Teachers are invited to participate in the conference Correcting Language and Texts – Teacher Feedback in School Practice, which will take place on the 5th of April 2023. For more information and the registration form, please visit https://www.cjvt.si/blog/konferenca-prop/.