Resources and tools

CJVT maintains resources and tools developed in the "Communication in Slovene" project and in other projects with similar results.


Gigafida, corpus of modern written Slovene

Gigafida is an extensive collection of Slovene text of various genres, from daily newspapers, magazines, all kinds of books (fiction, non-fiction, textbooks), web pages, transcriptions of parliamentary debates and similar. It contains almost 1.2 billion words, or exactly 1,187,002,502 words.


Kres, balanced corpus of modern written Slovene

Kres was sampled from the Gigafida corpus and is a balanced corpus, especially by text types or genres. Basic sampling units were not entire corpus documents but random paragraphs. It contains almost 100 million words, or exactly 99,831,145 words.


Gos, corpus of spoken Slovene

GOS is a corpus of transcripts of approximately 120 hours of speech that is used on a daily basis in various situations: radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations in buying and selling situations, etc.


Šolar, developmental corpus of Slovene

Šolar is a corpus collection of authentic texts written by pupils and students in Slovene primary and secondary schools during school classes. It contains 1 million words or, more exactly, 967,477 words. It incorporates also teachers’ comments and corrections which offer both the analysis of written production of students and the insight into what is actually corrected (and not corrected) in the teaching process.


Lektor, corpus of copy-edited texts

Lektor contains nearly 1 million words and consists of proofread documents corrected by "lektors", normatively oriented copy editors who are part of usual text production process in Slovenia. It is intended for researchers interested in the process of "lektoriranje", normative copy editing of Slovene.


Specialised corpora

Specialised corpora from different fields of science:

  • KoRP (Faculty of Social Sciences): Corpus KoRP is a synchronic monolingual corpus of written texts on public relations. It containes 1.8 million words form texts published between 1994 and 2007.
  • Corpus of specialised texts from the field of education (Faculty of Education).

Online concordancer for written language

A written corpus concordancer is a computer program which enables searching in large collections of texts – written corpora – on the web. The interface is simple, it was designed to enable user-friendly experience, also in schools. Through the interface, users can analyse and monitor how real modern Slovene is used, primarily in the two written corpora Gigafida and Kres.


Online concordancer for spoken language

A spoken corpus concordancer is a computer program which enables searching in large collections of spoken data – speech corpora – on the web. The interface is simple, it was designed to enable user-friendly experience, also in schools. Through the interface, users can analyse and monitor how real modern Slovene is spoken, primarily in the corpus Gos.


Collocations Dictionary of Modern Slovene

Collocations Dictionary of Modern Slovene contains 35,989 headwords and 7,338,801 collocations.


Sloleks, morphological lexicon

Sloleks is a lexicon of Slovene word forms. It contains, in an XML database, basic information on 100,000  Slovene words, especially their word class and related features. For each word, all its word forms are provided. Since Slovene is morphologically very rich language, each word has many word forms. Declension is typical of nouns, adjectives, pronouns, numerals, verbs and adverbs.


Thesaurus of Modern Slovene

Thesaurus of Modern Slovene contains 105,473 keywords and 368,117 synonyms, making it the largest automatically generated open-access collection of Slovene synonyms. Unlike other similar language resources, the Thesaurus is based on a range of different databases and enables users to compare different synonyms and check their use in the Gigafida reference corpus of modern Slovene.


Slovene lexical database

Lexical database is structured as a network of interrelated semantic and syntactic information about a particular word. The purpose of creating Slovene Lexical Database is, first, to fill the existing gap in comprehensive lexical description of Slovene both from the point of view of detecting changes in the modern vocabulary of Slovene and of introducing modern lexicographic procedures in Slovene lexicography. It contains 2,500 lemmas with detailed description of semantics, syntactic patterns, collocations, examples, multi-word expressions and phraseology.

In addition CJVT maintains specialised and other dictionaries created in various projects:

Dictionaries are available in the Termania portal.


Online orthography guide

The main goal of the Online Ortography Guide is to show difficult spots in the Slovene written and spoken standard and to offer explanations and solutions to these problems in an interesting and easy-to-understand manner. Explanations are devised on the basis of data from different corpus collections, and as such they represent modern view on language and depict real, live Slovene as it is used here and now.


Pedagogical grammar portal

Pedagogical grammar portal is a compilation of recurrent language problems organised by chapters. Each of the chapters focuses on a particular concrete language problem. As a result, each chapter represents a single entity and knowledge about other grammatical problems is not needed. A lot of effort was put into development of comprehensible explanations: terminology is limited to a minimum, more demanding terms are explained, the language of definitions is very simple. This concept enables teachers to combine school lessons with the use of portal at home.


Obeliks, statistical tagger for Slovene

Obeliks tagger  consists of three components: a segmentation and tokenization module which segments the text into sentences and words, the part-of-speech module itself which assigns information about the part-of-speech and its properties to each identified word, and a lemmatization module which assigns lemmas to each word.


Statistical syntactic parser for Slovene

MSTParser is a computer program for determining automatically the grammatical structure of a sentence. Syntactic parsing represents also one of the basic natural language processing procedures which supports more complex language technologies such as machine translation, information extraction, speech technologies, automatic summarization, question-answering etc.


ssj500k, manually annotated training corpus

ssj500k training corpus contains manually validated information obtained by segmentation, tokenization, lemmatization, morphosyntactic tagging, parsing (11,411 sentences) and name entity recognition (personal name, place name, proper name, organisation). In total, it contains  500,295 words, 27,829 sentences and 4,398 named entities.


ccGigafida in ccKres, written corpora with open access

ccGigafida and ccKres are two sampled subcorpora of the Gigafida corpus and its balanced version, Kres corpus. ccGigafida corpus contains approximately 9% or 100 million words, taken from the Gigafida corpus, and ccKres that contains approximately 9% or 10 million words, taken from the Kres corpus. The structure of the sample corpora is the same as the structure of their parent corpora. The ccGigafida and ccKRES corpora enable in-depth linguistic and computer (language technology) analyses of the Slovene language without any restrictions.


By continuing to browse the site, you are agreeing to our use of cookies. More Information >

More information: COOKIE POLICY

Our website uses “cookies” to distinguish between visitors and to perform website statistics usage. This allows us to improve the page constantly. Users who do not allow our website "cookies" to be recorded on their computer, will not be able to use all the functionalities of the website (video, comment on Facebook, etc.).Cookies are small files that a website that you visited records on your computer. The next time you are visiting the same site, the system can recognize you.

Our website uses the following types of cookies:

First-Party Cookies

PHPSESSID: this cookie is used for managing user session on the website. Session cookies: are used for temporary storage of information.

wordpress_test_cookie: A session cookie, deleted when you close your web browser.

_icl_current_language: WPML cookie, stores selected language version of the page. Expires in 24 hours.

Third-Party Cookies

datr: Facebook tracking cookie. Lifespan: 2 years.

fr: Facebook advertising cookie. Lifespan: 3 months.

reg_fb_gate: session cookie

reg_fb_ref: session cookie

Google Map (SID - expires after 2 years, SAPISID - expires after 2 years, APISID - expires after 2 years, SSID - expires after 2 years, HSID - expires after 2 years, NID - expires after 6 months, PREF - expires after 8 months): are used to follow the number of users and to track their behavior on Google Maps.

Hide Information