About the KOST corpus

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) is a digital collection of texts written by adult speakers for whom Slovene is not their first language. This corpus offers insight into Slovene language as produced by those who are still learning it as a second or foreign language, and in particular into the most common error that occur in this process. KOST therefore aims at all those working with Slovene as a second or foreign language.

What does KOST consist of?

The current version KOST 2.0 was published in November 2023. It consists of 1,514,476 tokens (8347 texts). The texts were mainly written at lectorates and Slovene as a L2/FL courses. Most of the authors of these texts speak Serbian, Bosnian and Macedonian as their first language, but texts by speakers of other languages are also included.

The authors are at different proficiency levels in Slovene, from beginners to advanced.

First languages

Albanian, Bosnian, Bulgarian, Chinese, Croatian, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Igbo, Indonesian, Italian, Japanese, Kiruna, Korean, Kyrygz, Macedonian, Montenegrin, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Spanish, Swedish, Turkish, Ukrainian


Mojca Stritar Kučuk