Korpusnik – corpus summarizer

The Korpusnik tool enables a quick basic overview of the word usage.

The Korpusnik tool summarises statistical and textual data from five corpora of the Slovenian language:

  • Standard Slovene: the Gigafida 2.0 reference corpus of written standard Slovene (1.3 billion words, texts from 1990-2018).
  • Current Slovene: the Trendi monitor corpus, updated monthly, which draws texts from online media portals. It contains texts from 2019 to the present.
  • Academic Slovene: the OSS 1.0 academic Slovene corpus (3.2 billion words) contains more than 150,000 scientific texts from the Open Science Slovenia portal.
  • Online Slovene: the JANES 1.0 corpus of online Slovene (more than 252 million words) contains texts from Slovene social media (blogs, comments on news, tweets).
  • Spoken Slovene: the Gos 2.0 spoken Slovene reference corpus (2.5 million words) contains approximately 300 hours of speech.

LINKS AND CONTACT

Iztok Kosem

Faculty of Computer and Information Science

Večna pot 113

1000 Ljubljana

  • E-mail: iztok.kosem@fri.uni-lj.si

In the Highlights tab, Korpusnik presents the main features of the search word from all five corpora. Two of the special features of the interface are the automatically generated Main Points, which summarise in text the relevant information presented on the page, and the descriptions of all diagrams.

In Korpusnik, you can track the usage of words from 1991 to the present, discover new words and meanings, find information such as when a word first appeared in Slovene, how much its usage has increased over the years of its use, which words (collocations) it appears with most often and in which Slovene corpus it is most common.

It is an innovative and internationally unique tool, designed to bring the rich data on words found in corpora to the general public. Particular attention was paid to accessibility, especially for people with disabilities. In order to do so, representatives of the Slovenian Association of Disabled Students were involved in the design and testing of the interface.

The Korpusnik Tool was developed in the framework of the SLOKIT project (Upgrading CLARIN.SI: Corpus Informer and Text Analyser), which ran from 2022 to 2023 and was funded by the Ministry of Culture of the Republic of Slovenia. The project was led by Dr Iztok Kosem. The project was co-financed by the Jožef Stefan Institute (lead partner) and the Slovenian Association of Disabled Students. Infrastructure support outside the project funding (hosting and maintenance of tools) is provided by the Centre for Language Resources and Technologies of the University of Ljubljana, also a member of the CLARIN.SI consortium.