Language technologies

TOOLS AND RESOURCES

Language technologies include software, online services and databases.

DICTIONARIES AND LEXICONS

DATABASES

LANGUAGE TECHNOLOGIES

ONLINE PORTALS AND INTERFACES

TEXT CORPORA

Senta

Online tool for sentence simplification and analysis

Senta assesses the complexity of each sentence and simplifies the complex ones, leaving the simple ones unchanged. For both the original and the simplified text, the main features are analysed, including the proportion of words that are not in the reference list or textbook. In Senta, the simplified text can be edited in either display mode and can be copied for use outside the application.

Vejice

Web tool for automatic comma placement

Use the Vejice (eng. Commas) tool to paste text of up to 3,000 characters into a box and press the red arrow. The tool then marks missing commas in grey and redundant commas in blue. It is designed to help with comma placing and is not a substitute for proofreading. According to tests, the software currently gives correct solutions in 94% of cases.

Berljivost

Application for assessing the readability of texts in Slovene

The Quality of Slovene Textbooks (KaUč) project has created the first application for assessing the readability of Slovene texts. It allows you to check the readability of texts of up to 5,000 characters. It will alert you to word-level problems: long, rare and repetitive words, abbreviations and vocabulary not found in textbooks. It will also mark long sentences and sentences with many or no verbs. It will also provide you with several different statistics and readability measures.

Obeliks

Statistical tagger for Slovene

A tagger is a computer program which segments any text into units and lets us assign specific information to individual words, i.e. parts of speech, gramamatical properties (gender, case, number, etc.) or enables us to assign its basic form in the case it has several inflected forms. The tagger can be tested here.

CJVT SVALA

A tool for the creation of corpora containing linguistic corrections

The CJVT Svala tool has been developed as a localised and adapted version of the open access Svala tool. It is used for building text corpora with linguistic corrections and similar resources where the alignment of two different versions of a text is of interest. The tool is useful for transcription, pseudonymization, alignment, as well as for marking up language corrections in texts. It currently supports tagging according to the systems used by two Slovenian corpora with language corrections: Šolar and KOST.

Slovenian Training Corpus SUK

Corpus for training statistical analysers

SUK is a training corpus that contains manually reviewed linguistic information, which has been added to the source text. This data is used for training machine learning algorithms, which build statistical models from it, or to check the correctness of the analysis by rule-based programs. In statistical programs, such a model trained on a corpus is used to analyse new, unknown texts.

Statistical syntactic parser for Slovene

The MSTParser is a computer program for determining the grammatical structure of a sentence automatically. This allows us to identify predicates, subject, objects etc. Syntactic parsing also represents one of the basic natural language processing procedures which supports more complex language technologies such as machine translation, information extraction, speech technologies, automatic summarization, question-answering etc.

ssj500k

A manually annotated training corpus

The ssj500k is a training corpus containing manually annotated grammatical information. This data is used for training computer programs for automatic text analysis which prepare a statistical model or are used to evaluate rule-based analysis programs.
It contains manually validated information obtained by segmentation, tokenization, lemmatization, morphosyntactic tagging, parsing and name entity recognition..

ccGigafida in ccKres

Open-access corpora

ccGigafida and ccKres are two sampled subcorpora of the Gigafida corpus and its balanced version, the Kres corpus. The ccGigafida corpus contains approximately 9% or 100 million words, taken from the Gigafida corpus. The ccKres contains approximately 9% or 10 million words, taken from the Kres corpus. The structure of the sample corpora is the same as the structure of their parent corpora. The ccGigafida and ccKRES corpora enable in-depth linguistic and computer (language technology) analyses of the Slovene language without any restrictions.