PROJECT DESCRIPTION

The Gigafida, Kres, ccGigafida and ccKres corpora form the basis for the development of modern language handbooks and language technologies for Slovene. Gigafida and Kres have user friendly interfaces and are used often by linguists, translators, editors, proofreaders, teachers and other similar user groups. These corpora are essential for language research and development, however they can only serve their purpose if they are continually updated and upgraded.

The project Upgrade of Gigafida, Kres, ccGigafida and ccKres is financed by the Ministry of Culture under the contract nr. 33400-15-141007 between the Ministry and the University of Ljubljana for the period 2015–2018. It is run by the Centre for Language Resources and Technologies and has three objectives: targeted acquisition of new materials, machine processing of new and existing materials, public availability and dissemination of upgraded corpora.

We will focus on types of texts which are currently underrepresented in Gigafida and Kres, i.e. mainly school reading materials and other popular literature. On the other hand we will add texts from selected news websites, which will ensure that the corpus data is more up-to-date. The new materials will enlarge the existing corpora by about a quarter, which in the case of Gigafida means it will grow from 1.2 to around 1.5 billion words. The technical aspects will be updated as well: we will develop tools for removing surplus copies of texts, improve the accuracy of linguistic annotation and divide standard language texts from texts which deviate from linguistic standards into subcorpora.