Gigafida, corpus of modern written Slovene
Gigafida is an extensive collection of Slovene text of various genres, from daily newspapers, magazines, all kinds of books (fiction, non-fiction, textbooks), web pages, transcriptions of parliamentary debates and similar. It contains almost 1.2 billion words, or exactly 1,187,002,502 words.
Kres, balanced corpus of modern written Slovene
Kres was sampled from the Gigafida corpus and is a balanced corpus, especially by text types or genres. Basic sampling units were not entire corpus documents but random paragraphs. It contains almost 100 million words, or exactly 99,831,145 words.
Gos, corpus of spoken Slovene
GOS is a corpus of transcripts of approximately 120 hours of speech that is used on a daily basis in various situations: radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations in buying and selling situations, etc.
Šolar, developmental corpus of Slovene
Šolar is a corpus collection of authentic texts written by pupils and students in Slovene primary and secondary schools during school classes. It contains 1 million words or, more exactly, 967,477 words. It incorporates also teachers’ comments and corrections which offer both the analysis of written production of students and the insight into what is actually corrected (and not corrected) in the teaching process.
Lektor, corpus of copy-edited texts
Lektor contains nearly 1 million words and consists of proofread documents corrected by "lektors", normatively oriented copy editors who are part of usual text production process in Slovenia. It is intended for researchers interested in the process of "lektoriranje", normative copy editing of Slovene.
Specialised corpora from different fields of science:
- KoRP (Faculty of Social Sciences): Corpus KoRP is a synchronic monolingual corpus of written texts on public relations. It containes 1.8 million words form texts published between 1994 and 2007.
- Corpus of specialised texts from the field of education (Faculty of Education).
Online concordancer for written language
A written corpus concordancer is a computer program which enables searching in large collections of texts – written corpora – on the web. The interface is simple, it was designed to enable user-friendly experience, also in schools. Through the interface, users can analyse and monitor how real modern Slovene is used, primarily in the two written corpora Gigafida and Kres.
Online concordancer for spoken language
A spoken corpus concordancer is a computer program which enables searching in large collections of spoken data – speech corpora – on the web. The interface is simple, it was designed to enable user-friendly experience, also in schools. Through the interface, users can analyse and monitor how real modern Slovene is spoken, primarily in the corpus Gos.
Collocations Dictionary of Modern Slovene
Collocations Dictionary of Modern Slovene contains 35,989 headwords and 7,338,801 collocations.
Sloleks, morphological lexicon
Sloleks is a lexicon of Slovene word forms. It contains, in an XML database, basic information on 100,000 Slovene words, especially their word class and related features. For each word, all its word forms are provided. Since Slovene is morphologically very rich language, each word has many word forms. Declension is typical of nouns, adjectives, pronouns, numerals, verbs and adverbs.
Thesaurus of Modern Slovene
Thesaurus of Modern Slovene contains 105,473 keywords and 368,117 synonyms, making it the largest automatically generated open-access collection of Slovene synonyms. Unlike other similar language resources, the Thesaurus is based on a range of different databases and enables users to compare different synonyms and check their use in the Gigafida reference corpus of modern Slovene.
Slovene lexical database
Lexical database is structured as a network of interrelated semantic and syntactic information about a particular word. The purpose of creating Slovene Lexical Database is, first, to fill the existing gap in comprehensive lexical description of Slovene both from the point of view of detecting changes in the modern vocabulary of Slovene and of introducing modern lexicographic procedures in Slovene lexicography. It contains 2,500 lemmas with detailed description of semantics, syntactic patterns, collocations, examples, multi-word expressions and phraseology.
In addition CJVT maintains specialised and other dictionaries created in various projects:
- Dictionary of education terminology
- Terminological database for the field of public relations
- Dictionary of turism
- Dictionaries of Prekmurje and Dolenjsko Romani
- Language technology dictionary
- and others.
Dictionaries are available in the Termania portal.
Online orthography guide
The main goal of the Online Ortography Guide is to show difficult spots in the Slovene written and spoken standard and to offer explanations and solutions to these problems in an interesting and easy-to-understand manner. Explanations are devised on the basis of data from different corpus collections, and as such they represent modern view on language and depict real, live Slovene as it is used here and now.
Pedagogical grammar portal
Pedagogical grammar portal is a compilation of recurrent language problems organised by chapters. Each of the chapters focuses on a particular concrete language problem. As a result, each chapter represents a single entity and knowledge about other grammatical problems is not needed. A lot of effort was put into development of comprehensible explanations: terminology is limited to a minimum, more demanding terms are explained, the language of definitions is very simple. This concept enables teachers to combine school lessons with the use of portal at home.
Obeliks, statistical tagger for Slovene
Obeliks tagger consists of three components: a segmentation and tokenization module which segments the text into sentences and words, the part-of-speech module itself which assigns information about the part-of-speech and its properties to each identified word, and a lemmatization module which assigns lemmas to each word.
Statistical syntactic parser for Slovene
MSTParser is a computer program for determining automatically the grammatical structure of a sentence. Syntactic parsing represents also one of the basic natural language processing procedures which supports more complex language technologies such as machine translation, information extraction, speech technologies, automatic summarization, question-answering etc.
ssj500k, manually annotated training corpus
ssj500k training corpus contains manually validated information obtained by segmentation, tokenization, lemmatization, morphosyntactic tagging, parsing (11,411 sentences) and name entity recognition (personal name, place name, proper name, organisation). In total, it contains 500,295 words, 27,829 sentences and 4,398 named entities.
ccGigafida in ccKres, written corpora with open access
ccGigafida and ccKres are two sampled subcorpora of the Gigafida corpus and its balanced version, Kres corpus. ccGigafida corpus contains approximately 9% or 100 million words, taken from the Gigafida corpus, and ccKres that contains approximately 9% or 10 million words, taken from the Kres corpus. The structure of the sample corpora is the same as the structure of their parent corpora. The ccGigafida and ccKRES corpora enable in-depth linguistic and computer (language technology) analyses of the Slovene language without any restrictions.