{"id":1700,"date":"2025-07-16T09:51:11","date_gmt":"2025-07-16T07:51:11","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?p=1700"},"modified":"2025-07-16T15:01:14","modified_gmt":"2025-07-16T13:01:14","slug":"1700","status":"publish","type":"post","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/mproving-linguistic-data-with-llms\/","title":{"rendered":"Improving Linguistic Data with LLMs"},"content":{"rendered":"
Large language models (LLMs) are revolutionising the way we access information, communicate and work. In addition to everyday applications, LLMs are also reshaping scientific fields such as language studies, humanities and social sciences. However, although their capabilities are diverse, LLMs still have their limitations: they can provide inconsistent or incorrect answers, require significant computational resources, perform poorly on less-resourced languages and struggle with tasks involving social understanding, ethics and human needs.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\n A promising way to improve LLM performance is to use high-quality lexicographic data. Such data can support LLM pre-training by providing both raw text and structured information, including synonymy, antonymy, hyponymy, hypernymy, meronymy, holonymy, sense distributions, idiomatic expressions and cross-linguistic distributions. Despite its potential, this rich linguistic knowledge is not yet fully utilised in existing LLMs.<\/span><\/p>\n By integrating this type of data into the development of LLMs, we can reduce hallucinations, improve language proficiency in complex contexts, and strengthen fine-tuning for tasks such as commonsense reasoning and natural language inference. Our project focuses on applying this approach to Slovenian \u2014 a less-resourced, morphologically rich language that lacks the digital, educational, and institutional support that global languages such as English enjoy.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\n We have developed a novel methodology for extracting knowledge graphs from digital linguistic databases that is tailored to morphologically complex languages. Specifically, we applied this methodology to the Digital Dictionary Database for Slovene (DDD), the largest freely accessible lexical-lexicographical resource for Slovene, and several other structured Slovene lexicographic resources.<\/span><\/p>\n The resulting corpus comprises 356,294 words and was built from single-lexeme entries sourced from the DDD. Only individual words (no multi-word expressions) were included to ensure a clear lexical focus. For each word, all morphological forms were listed using data from DDD. Definitions of word senses were collected from multiple sources\u2014SSKJ, sloWnet, and the Bridge Dictionary\u2014while semantic indicators from DDD were used when full definitions weren\u2019t available. Usage examples were included where present (from SSJK and DDD). Common collocations were added based on DDD data. Synonyms were sourced from a dedicated synonyms dictionary, often grouped by word sense and labeled with semantic indicators where possible (see Picture 1).<\/span><\/p>\n The corpus is saved as a structured markdown file, designed for both human readability and machine parsing and is freely available for everyone to use.<\/span><\/p>\n<\/div><\/section><\/p><\/div>\nAddressing the Research Challenge<\/h3>\n<\/div><\/section>
\nOur Approach: Extracting Knowledge Graphs from Lexical Resources<\/b><\/h3>\n<\/div><\/section>
\n