Language technologies include software, online services and databases.


Web tool for automatic comma placement

Use the Vejice (eng. Commas) tool to paste text of up to 3,000 characters into a box and press the red arrow. The tool then marks missing commas in grey and redundant commas in blue. It is designed to help with comma placing and is not a substitute for proofreading. According to tests, the software currently gives correct solutions in 94% of cases.


Application for assessing the readability of texts in Slovene

The Quality of Slovene Textbooks (KaUč) project has created the first application for assessing the readability of Slovene texts. It allows you to check the readability of texts of up to 5,000 characters. It will alert you to word-level problems: long, rare and repetitive words, abbreviations and vocabulary not found in textbooks. It will also mark long sentences and sentences with many or no verbs. It will also provide you with several different statistics and readability measures.


Statistical tagger for Slovene

A tagger is a computer program which segments any text into units and lets us assign specific information to individual words, i.e. parts of speech, gramamatical properties (gender, case, number, etc.) or enables us to assign its basic form in the case it has several inflected forms. The tagger can be tested here.


A tool for the creation of corpora containing linguistic corrections

The CJVT Svala tool has been developed as a localised and adapted version of the open access Svala tool. It is used for building text corpora with linguistic corrections and similar resources where the alignment of two different versions of a text is of interest. The tool is useful for transcription, pseudonymization, alignment, as well as for marking up language corrections in texts. It currently supports tagging according to the systems used by two Slovenian corpora with language corrections: Šolar and KOST.

Slovenian Training Corpus SUK

Corpus for training statistical analysers

SUK is a training corpus that contains manually reviewed linguistic information, which has been added to the source text. This data is used for training machine learning algorithms, which build statistical models from it, or to check the correctness of the analysis by rule-based programs. In statistical programs, such a model trained on a corpus is used to analyse new, unknown texts.

Statistical syntactic parser for Slovene

The MSTParser is a computer program for determining the grammatical structure of a sentence automatically. This allows us to identify predicates, subject, objects etc. Syntactic parsing also represents one of the basic natural language processing procedures which supports more complex language technologies such as machine translation, information extraction, speech technologies, automatic summarization, question-answering etc.


A manually annotated training corpus

The ssj500k is a training corpus containing manually annotated grammatical information. This data is used for training computer programs for automatic text analysis which prepare a statistical model or are used to evaluate rule-based analysis programs.
It contains manually validated information obtained by segmentation, tokenization, lemmatization, morphosyntactic tagging, parsing and name entity recognition..

ccGigafida in ccKres

Open-access corpora

ccGigafida and ccKres are two sampled subcorpora of the Gigafida corpus and its balanced version, the Kres corpus. The ccGigafida corpus contains approximately 9% or 100 million words, taken from the Gigafida corpus. The ccKres contains approximately 9% or 10 million words, taken from the Kres corpus. The structure of the sample corpora is the same as the structure of their parent corpora. The ccGigafida and ccKRES corpora enable in-depth linguistic and computer (language technology) analyses of the Slovene language without any restrictions.