In this project, a clear and understandable user interface for the corpusStatistics tool (renamed to LIST) was developed. It enables its users to easily access language statistics in Slovene and other corpora. The tool was adapted to several formats and tested on large Slovene and other corpora.

Metadata was added to the output which enables repeatability. The interface elements now have short explanations that show up while hovering over them.

An option to show different collocation calculations (eg. Dice, t-score, MI, MI3) for the extracted word sets was added. Additionally, a calculation of the processing time was added. Furthermore, we implemented warnings for options that affect processing time.

It is also possible to switch between the Slovene and English version. What is more, non-latin alphabets can be processed.

The program was upgraded to support the TEI P5 format that is used for new corpora in the CLARIN-SI repository as well as the Vert format used in SkentchEngine.

The LIST tool is available under the open license Apache2 at:

Krsnik, Luka; et al., 2019, Corpus extraction tool LIST 1.0, Slovenian language resource repository CLARIN.SI,


Jaka Čibej
Centre for Language Resources and Technologies at the University of Ljubljana
Faculty of Computer and Information Science UL
Večna pot 113, SI-1000 Ljubljana