{"id":956,"date":"2020-03-31T22:48:06","date_gmt":"2020-03-31T20:48:06","guid":{"rendered":"https:\/\/www.cjvt.starkmat.si\/template-projekt\/work-packages\/work-package-1\/"},"modified":"2025-05-14T12:28:37","modified_gmt":"2025-05-14T10:28:37","slug":"work-package-1","status":"publish","type":"page","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-2\/","title":{"rendered":"Challenge 2: LLMs for Linguistics and Knowledge Management"},"content":{"rendered":"

Challenge <\/strong>2: LLMs for Linguistics and Knowledge Management<\/strong><\/h1>\n<\/div><\/section><\/div>\n

Slovenian, with its two-million language-speaker community, represents a good example of a less-resourced language, as was also shown in the comparison provided by the European Language Equality project (Rehm & Way, 2023). The research challenge in this task is to enable the generation of quality semantic descriptions for (severely) less-resourced languages and to enable large-scale language comparisons to provide new insights into the grammar of the world\u2019s languages and to facilitate.<\/p>\n<\/div><\/section><\/div>\n<\/div><\/div><\/div><\/div><\/div>

Task 2.1<\/span><\/span><\/span><\/span><\/a>Task 2.2<\/span><\/span><\/span><\/span><\/a>Task 2.3<\/span><\/span><\/span><\/span><\/a>Yearly reports<\/span><\/span><\/span><\/span><\/a><\/div>
<\/span><\/span>\n

T2.1 LLMs for effective lexicography <\/em><\/strong><\/h3>\n<\/div><\/section>
\n
\n
\n

For the last thirty years, lexicographers working on the description of the lexicon attempted to use various automated procedures to analyze language data and generate lexicographic descriptions (Atkins and Rundell 2008, Gantar et al. 2016). The latest developments in generative AI triggered attempts to use new tools for lexicographic purposes (Lew 2023; Rees et al. 2023; Jakub\u00ed\u010dek & Rundell 2023). However, after the first attempts, it was found that there is a significant difference between the ability of LLMs to produce quality lexicographic content for English and for other languages, in particular for the less-resourced languages or those that are under-represented in LLMs (de Schryver, 2024).<\/p>\n<\/div>\n

The Digital Dictionary Database for Slovene (DDDS) will be improved on various levels of linguistic description using the models produced in T1.1. We will generate morphological and semantic data, focusing on 1) morphological paradigm generation, 2) word-sense discrimination, 3) generation of various types of definitions (semantic indicators, simplified, terminological, etc.), 4) improving collocations and examples of use; 5) attribution of labels (stylistic, normative, domain, genre, etc.) 6) description of idiomatic, figurative and metaphorical language, etc. The result will be a significantly improved DDDS, which will be, in turn, used for improving models in T1.1. All versions of DDDS will be available as publicly available datasets and via open-access API.<\/p>\n

<\/div>\n<\/section>\n
\n
<\/div>\n<\/section>\n<\/div><\/section>
\n