Compiling keyword lists and n-grams from a textbook corpus for different school years and subjects

In this project, we built a corpus of textbooks for primary and secondary school and extracted a list of words, n-grams and keywords. The corpus was converted from PDF and HTML into text that was corrected and annotated according to structure. It contains 5 million tokens from 127 textbooks for 16 subjects. The next step was extracting the lists according to several criteria. All lists were checked manually. The following lists were compiled:

General words list that appear in at least 8 from 16 subjects. It contains lemma, word type, frequency (also according to school subject) and subject count data.

General words list according to school year (class) containing data on lemma, word type, frequency (also according to school year) and subject number (from 16).

2-5-gram list containing data on n-gram type, lemma, word type, morphological annotations, and subject count and frequency.

The lists are available under the CC BY licence at:

Kosem, Iztok; Pori, Eva and Arhar Holdt, Špela, 2019, Keywords and n-grams from a textbook corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1215.

LINKS AND CONTACT

Dr. Iztok Kosem
Centre for Language Resources and Technologies at the University of Ljubljana
Faculty of Computer and Information Science
Večna pot 113, SI-1000 Ljubljana