Šolar corpus
The Šolar developmental corpus contains texts that Slovenian primary and secondary school pupils have independently produced in class. The texts also contain teacher’s corrections.
The Šolar developmental corpus contains texts that have been produced independently by pupils in various Slovenian primary and secondary schools. A large part of the texts also contains teacher’s corrections (linguistic and contextual).
The Šolar corpus is modelled on the language acquisition corpora, but differs in that (a) the texts are not project-initiated, but represent actual school production by students, and (b) the linguistic corrections highlighted in the corpus are real and made by teachers, not researchers. These features make Šolar a valuable and unique resource not only in Slovenia but also internationally.
LINKS AND CONTACTS
dr. Špela Arhar Holdt
Faculty of Computer and Information Science UL
1000 Ljubljana
- E-mail: spela.arharholdt@ff-uni-lj.si
The current version, Šolar 3.0, contains 5,485 texts written by Slovenian secondary school students (15-19 years old) and primary school students in grades 7-9, with a small percentage from grade 6. For each text, information is given on the school (primary or secondary), the subject, the level (grade or year), the type of text, the region and the year of production. The majority of the corpus is made up of essays, but there are also other texts produced in the classroom, such as summaries or descriptions of texts, examples of formal applications, etc. More information about the Šolar 3.0 corpus can be found in this scientific paper.
Part of the corpus (2,094 texts) contains teacher corrections, which are also classified according to the content classification system described in the annotation guidelines (the guidelines are in Slovene). The annotation of the corrections (there are more than 35,000 corrections in the corpus) is more detailed than in other similar projects, which is useful for the preparation of teaching materials, tools for machine correction of Slovene texts, etc. The corrections, as authentic examples of giving feedback for the development of writing skills, are valuable for the training of future teachers, linguistic-didactic research, etc.
The different versions of the corpus are available in the CLARIN.SI repository under an open licence, which means that they can be used for different research and development purposes. Links can be found in the Databases section of our website. In addition, pre-prepared exports of corpus data, such as a frequency list of linguistic corrections, a list of collocations and a list of syntactic structures, are also available on CLARIN.SI.
As the preparation of corpora with tagged corrections is extremely time-consuming, it is crucial to make sure that the data is easily accessible for different types of use. Center of language resources and technologies has therefore developed a new format, as well as a completely new corpus concordancer for such corpora. Unlike previous similar tools, this one allows powerful searching and transparent use of the results, especially when it comes to rich corpus metatags and language corrections.
Šolar 3.0 is available in the CJVT condordancer below.