Text corpora

TOOLS AND RESOURCES

Corpora are electronic collections of authentic texts that were structured according to predefined standards and goals. They include tools for multilayered language data search.

DICTIONARIES AND LEXICONS

DATABASES

LANGUAGE TECHNOLOGIES

ONLINE PORTALS AND INTERFACES

TEXT CORPORA

Gigafida 2.0

Reference corpus of written standard Slovene

Gigafida 2.0 is an extensive and thoughtfully composed reference corpus containing 1,134,693,933 words from 38,310 texts which were composed between 1990 and 2018. The Gigafida 2.0 corpus is a fundamental data source of modern Slovene used for linguistic research, describing the language (dictionaries, grammars), preparing learning materials, developing a variety of language resources and processes. Unlike the previous editions, the 2.0 version is a corpus of standard Slovene, which means it mainly contains text that are written in the standard language.

Šolar 3.0

Corpus of school written products

Šolar 3.0 contains the same texts as Šolar 2.0, but significant improvements have been made at the level of the format, which is now a specialised XML TEI for corpora with linguistic corrections. Error annotations in some 350 texts have been manually corrected, and newer versions of tools for different levels of linguistic annotation have been used. As a result, the morphosyntactic tags more reliable and new annotation levels are available, e.g. dependency syntax and named entities. The corpus is available in the CLARIN.SI concordancers, with the students’ source texts and the teacher-corrected texts offered separately.

Šolar 2.0

Corpus of school texts

Šolar 2.0 is an extensive and thoughtfully composed reference corpus containing 1,134,693,933 words from 38,310 texts which were composed between 1990 and 2018. The Gigafida 2.0 corpus is a fundamental data source of modern Slovene used for linguistic research, describing the language (dictionaries, grammars), preparing learning materials, developing a variety of language resources and processes. Unlike the previous editions, the 2.0 version is a corpus of standard Slovene, which means it mainly contains text that are written in the standard language.

Gigafida 1.0

The reference corpus of written standard Slovene

Gigafida is an extensive collection of Slovene text of various genres, from daily newspapers, magazines, all kinds of books (fiction, non-fiction, textbooks), web pages, transcriptions of parliamentary debates and similar. It contains almost 1.2 billion words, or exactly 1,187,002,502 words. The corpus contains texts written between 1990 and 2011. The first version of the corpus was built during the project Communication in Slovene.

Kres

Balanced corpus of modern written Slovene

Kres was sampled from the Gigafida corpus and is a balanced corpus that contains almost 100 million words, or exactly 99,831,145 words. Basic sampling units were not entire corpus documents but random paragraphs, which means individual works are represented in a better way. In comparison to Gigafida, the Kres corpus is meant for any type of linguistic inquiries that strive to achieve a reference role that can stem form the corpus sample – a sample, with a well-thought-out, known and balanced structure.

Gos

Corpus of spoken Slovene

Gos includes the transcripts of approximately 120 hours of speech that we are exposed to on a daily basis in various situations: radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations in buying and selling situations, etc. All speech is transcribed in two versions – with pronunciation-based spelling and with standardized spelling – and it comprises over one million words. The corpus can be searched by means of a web concordancer; furthermore, for all concordances it is possible to listen to the corresponding recordings.

Šolar 1.0

Corpus of school texts

The Šolar corpus includes authentic texts written by Slovene primary and secondary school pupils. It contains one million words or, more exactly, 967,477 words. Based on the concept of foreign language learners’ corpora, it is the first corpus of this type in Slovenia. It was compiled to enable researching the written linguistic capacity of the in-school population and was already used to make language resources, such as The pedagogical grammar portal

Lektor

Corpus of copy-edited texts

Lektor is an extensive collection of copyrighted texts and translations and is intended for anyone who is interested in the process of copyediting. This type of corpus enables us to see the most frequent language errors in Slovene (excluding prefferential and stylistic corrections). It includes modern non-literary, mostly technical and popualar-science texts which were all written by different authors and corrected by different copyeditors. It contains 30,258 copyedits which are divided into 5 main categories (style, morphology, ortography, syntax and pragmatics) and 50 subcategories.

KoRP

Corpus of written texts on public relations

KoRP is a synchronic monolingual corpus of written texts on public relations. It was compiled at the Faculty of Social Sciences at the University of Ljubljana. The corpus contains 1.8 million words from texts published between 1994 and 2007. During the Termis project, it served as a basis for the terminology data banks for public relations.

TOOLS AND RESOURCES

Gigafida 2.0

Reference corpus of written standard Slovene

Šolar 3.0

Corpus of school written products

Šolar 2.0

Corpus of school texts

Gigafida 1.0

The reference corpus of written standard Slovene

Kres

Balanced corpus of modern written Slovene

Gos

Corpus of spoken Slovene

Šolar 1.0

Corpus of school texts

Lektor

Corpus of copy-edited texts

KoRP

Corpus of written texts on public relations

CONTACT

LOCATION

INFO

FACEBOOK