Gigafida is an extensive collection of Slovene text of various genres, from daily newspapers, magazines, all kinds of books (fiction, non-fiction, textbooks), web pages, transcriptions of parliamentary debates and similar. It contains almost 1.2 billion words, or exactly 1,187,002,502 words.
Kres was sampled from the Gigafida corpus and is a balanced corpus, especially by text types or genres. Basic sampling units were not entire corpus documents but random paragraphs. It contains almost 100 million words, or exactly 99,831,145 words.
GOS is a corpus of transcripts of approximately 120 hours of speech that is used on a daily basis in various situations: radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations in buying and selling situations, etc.
Šolar is a corpus collection of authentic texts written by pupils and students in Slovene primary and secondary schools during school classes. It contains 1 million words or, more exactly, 967,477 words. It incorporates also teachers’ comments and corrections which offer both the analysis of written production of students and the insight into what is actually corrected (and not corrected) in the teaching process.
Lektor contains nearly 1 million words and consists of proofread documents corrected by "lektors", normatively oriented copy editors who are part of usual text production process in Slovenia. It is intended for researchers interested in the process of "lektoriranje", normative copy editing of Slovene.
Specialised corpora from different fields of science:
A written corpus concordancer is a computer program which enables searching in large collections of texts – written corpora – on the web. The interface is simple, it was designed to enable user-friendly experience, also in schools. Through the interface, users can analyse and monitor how real modern Slovene is used, primarily in the two written corpora Gigafida and Kres.
A spoken corpus concordancer is a computer program which enables searching in large collections of spoken data – speech corpora – on the web. The interface is simple, it was designed to enable user-friendly experience, also in schools. Through the interface, users can analyse and monitor how real modern Slovene is spoken, primarily in the corpus Gos.
Sloleks is a lexicon of Slovene word forms. It contains, in an XML database, basic information on 100,000 Slovene words, especially their word class and related features. For each word, all its word forms are provided. Since Slovene is morphologically very rich language, each word has many word forms. Declension is typical of nouns, adjectives, pronouns, numerals, verbs and adverbs.
Thesaurus of Modern Slovene contains 105,473 keywords and 368,117 synonyms, making it the largest automatically generated open-access collection of Slovene synonyms. Unlike other similar language resources, the Thesaurus is based on a range of different databases and enables users to compare different synonyms and check their use in the Gigafida reference corpus of modern Slovene.
Lexical database is structured as a network of interrelated semantic and syntactic information about a particular word. The purpose of creating Slovene Lexical Database is, first, to fill the existing gap in comprehensive lexical description of Slovene both from the point of view of detecting changes in the modern vocabulary of Slovene and of introducing modern lexicographic procedures in Slovene lexicography. It contains 2,500 lemmas with detailed description of semantics, syntactic patterns, collocations, examples, multi-word expressions and phraseology.
In addition CJVT maintains specialised and other dictionaries created in various projects:
Dictionaries are available in the Termania portal.
The main goal of the Online Ortography Guide is to show difficult spots in the Slovene written and spoken standard and to offer explanations and solutions to these problems in an interesting and easy-to-understand manner. Explanations are devised on the basis of data from different corpus collections, and as such they represent modern view on language and depict real, live Slovene as it is used here and now.
Pedagogical grammar portal is a compilation of recurrent language problems organised by chapters. Each of the chapters focuses on a particular concrete language problem. As a result, each chapter represents a single entity and knowledge about other grammatical problems is not needed. A lot of effort was put into development of comprehensible explanations: terminology is limited to a minimum, more demanding terms are explained, the language of definitions is very simple. This concept enables teachers to combine school lessons with the use of portal at home.
Obeliks tagger consists of three components: a segmentation and tokenization module which segments the text into sentences and words, the part-of-speech module itself which assigns information about the part-of-speech and its properties to each identified word, and a lemmatization module which assigns lemmas to each word.
MSTParser is a computer program for determining automatically the grammatical structure of a sentence. Syntactic parsing represents also one of the basic natural language processing procedures which supports more complex language technologies such as machine translation, information extraction, speech technologies, automatic summarization, question-answering etc.
ssj500k training corpus contains manually validated information obtained by segmentation, tokenization, lemmatization, morphosyntactic tagging, parsing (11,411 sentences) and name entity recognition (personal name, place name, proper name, organisation). In total, it contains 500,295 words, 27,829 sentences and 4,398 named entities.
ccGigafida and ccKres are two sampled subcorpora of the Gigafida corpus and its balanced version, Kres corpus. ccGigafida corpus contains approximately 9% or 100 million words, taken from the Gigafida corpus, and ccKres that contains approximately 9% or 10 million words, taken from the Kres corpus. The structure of the sample corpora is the same as the structure of their parent corpora. The ccGigafida and ccKRES corpora enable in-depth linguistic and computer (language technology) analyses of the Slovene language without any restrictions.
Our website uses “cookies” to distinguish between visitors and to perform website statistics usage. This allows us to improve the page constantly. Users who do not allow our website "cookies" to be recorded on their computer, will not be able to use all the functionalities of the website (video, comment on Facebook, etc.).Cookies are small files that a website that you visited records on your computer. The next time you are visiting the same site, the system can recognize you.
Our website uses the following types of cookies:
PHPSESSID: this cookie is used for managing user session on the website. Session cookies: are used for temporary storage of information.
wordpress_test_cookie: A session cookie, deleted when you close your web browser.
_icl_current_language: WPML cookie, stores selected language version of the page. Expires in 24 hours.
datr: Facebook tracking cookie. Lifespan: 2 years.
fr: Facebook advertising cookie. Lifespan: 3 months.
reg_fb_gate: session cookie
reg_fb_ref: session cookie
Google Map (SID - expires after 2 years, SAPISID - expires after 2 years, APISID - expires after 2 years, SSID - expires after 2 years, HSID - expires after 2 years, NID - expires after 6 months, PREF - expires after 8 months): are used to follow the number of users and to track their behavior on Google Maps.