Language technologies include software, online services and databases.


Statistical tagger for Slovene

A tagger is a computer program which segments any text into units and lets us assign specific information to individual words, i.e. parts of speech, gramamatical properties (gender, case, number, etc.) or enables us to assign its basic form in the case it has several inflected forms. The tagger can be tested here.

Statistical syntactic parser for Slovene

The MSTParser is a computer program for determining the grammatical structure of a sentence automatically. This allows us to identify predicates, subject, objects etc. Syntactic parsing also represents one of the basic natural language processing procedures which supports more complex language technologies such as machine translation, information extraction, speech technologies, automatic summarization, question-answering etc.


A manually annotated training corpus

The ssj500k is a training corpus containing manually annotated grammatical information. This data is used for training computer programs for automatic text analysis which prepare a statistical model or are used to evaluate rule-based analysis programs.
It contains manually validated information obtained by segmentation, tokenization, lemmatization, morphosyntactic tagging, parsing and name entity recognition..

ccGigafida in ccKres

Open-access corpora

ccGigafida and ccKres are two sampled subcorpora of the Gigafida corpus and its balanced version, the Kres corpus. The ccGigafida corpus contains approximately 9% or 100 million words, taken from the Gigafida corpus. The ccKres contains approximately 9% or 10 million words, taken from the Kres corpus. The structure of the sample corpora is the same as the structure of their parent corpora. The ccGigafida and ccKRES corpora enable in-depth linguistic and computer (language technology) analyses of the Slovene language without any restrictions.