TOOLS AND RESOURCES
Corpora are electronic collections of authentic texts that were structured according to predefined standards and goals. They include tools for multilayered language data search.
Gigafida 2.0Reference corpus of written standard SloveneGigafida 2.0 is an extensive and thoughtfully composed reference corpus containing 1,134,693,933 words from 38,310 texts which were composed between 1990 and 2018. The Gigafida 2.0 corpus is a fundamental data source of modern Slovene used for linguistic research, describing the language (dictionaries, grammars), preparing learning materials, developing a variety of language resources and processes. Unlike the previous editions, the 2.0 version is a corpus of standard Slovene, which means it mainly contains text that are written in the standard language. |
|
Šolar 2.0Corpus of school textsŠolar 2.0 is an extensive and thoughtfully composed reference corpus containing 1,134,693,933 words from 38,310 texts which were composed between 1990 and 2018. The Gigafida 2.0 corpus is a fundamental data source of modern Slovene used for linguistic research, describing the language (dictionaries, grammars), preparing learning materials, developing a variety of language resources and processes. Unlike the previous editions, the 2.0 version is a corpus of standard Slovene, which means it mainly contains text that are written in the standard language. |
|
Gigafida 1.0The reference corpus of written standard SloveneGigafida is an extensive collection of Slovene text of various genres, from daily newspapers, magazines, all kinds of books (fiction, non-fiction, textbooks), web pages, transcriptions of parliamentary debates and similar. It contains almost 1.2 billion words, or exactly 1,187,002,502 words. The corpus contains texts written between 1990 and 2011. The first version of the corpus was built during the project Communication in Slovene. |
|
KresBalanced corpus of modern written SloveneKres was sampled from the Gigafida corpus and is a balanced corpus that contains almost 100 million words, or exactly 99,831,145 words. Basic sampling units were not entire corpus documents but random paragraphs, which means individual works are represented in a better way. In comparison to Gigafida, the Kres corpus is meant for any type of linguistic inquiries that strive to achieve a reference role that can stem form the corpus sample – a sample, with a well-thought-out, known and balanced structure. |
|
GosCorpus of spoken SloveneGos includes the transcripts of approximately 120 hours of speech that we are exposed to on a daily basis in various situations: radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations in buying and selling situations, etc. All speech is transcribed in two versions – with pronunciation-based spelling and with standardized spelling – and it comprises over one million words. The corpus can be searched by means of a web concordancer; furthermore, for all concordances it is possible to listen to the corresponding recordings. |
|
Šolar 1.0Corpus of school textsThe Šolar corpus includes authentic texts written by Slovene primary and secondary school pupils. It contains one million words or, more exactly, 967,477 words. Based on the concept of foreign language learners’ corpora, it is the first corpus of this type in Slovenia. It was compiled to enable researching the written linguistic capacity of the in-school population and was already used to make language resources, such as The pedagogical grammar portal |
|
LektorCorpus of copy-edited textsLektor is an extensive collection of copyrighted texts and translations and is intended for anyone who is interested in the process of copyediting. This type of corpus enables us to see the most frequent language errors in Slovene (excluding prefferential and stylistic corrections). It includes modern non-literary, mostly technical and popualar-science texts which were all written by different authors and corrected by different copyeditors. It contains 30,258 copyedits which are divided into 5 main categories (style, morphology, ortography, syntax and pragmatics) and 50 subcategories. |
|
KoRPCorpus of written texts on public relationsKoRP is a synchronic monolingual corpus of written texts on public relations. It was compiled at the Faculty of Social Sciences at the University of Ljubljana. The corpus contains 1.8 million words from texts published between 1994 and 2007. During the Termis project, it served as a basis for the terminology data banks for public relations. |
|