{"id":1137,"date":"2024-12-24T10:59:51","date_gmt":"2024-12-24T09:59:51","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?page_id=1137"},"modified":"2025-05-14T12:28:17","modified_gmt":"2025-05-14T10:28:17","slug":"challenge-1","status":"publish","type":"page","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-1\/","title":{"rendered":"Challenge 1: Improving LLMs with linguistic resources and development of vision-language models"},"content":{"rendered":"

Challenge 1: Improving LLMs with linguistic resources and development of vision-language models<\/strong><\/h1>\n<\/div><\/section><\/div>\n

LLMs require large amounts of high-quality textual data for their training and fine-tuning for specific tasks. High-quality lexicographic data can help pretrain LLMs by producing different types of data, in particular knowledge graphs and raw text. The available information in lexicographic resources includes relations, information on sense distribution with definitions of word senses, cross-lingual connections, identification and description of idiomatic or figurative expressions, etc.<\/p>\n

This information trove is not yet adequately utilized by LLMs but it could reduce their hallucination, improve their language proficiency in complex situations and less-resourced languages, and improve fine-tuning of LLMs for specific important tasks, such as commonsense reasoning and natural language inference.<\/p>\n<\/div><\/section><\/div>\n<\/div><\/div><\/div><\/div><\/div>

Task 1.1<\/span><\/span><\/span><\/span><\/a>Task 1.2<\/span><\/span><\/span><\/span><\/a>Task 1.3<\/span><\/span><\/span><\/span><\/a>Yearly reports<\/span><\/span><\/span><\/span><\/a><\/div>
<\/span><\/span>\n

Task 1.1:\u00a0Improving LLMs with linguistic data<\/em><\/strong><\/span><\/h3>\n<\/div><\/section>
\n
\n
\n

The first challenge will address LLM improvements with monolingual lexicographic data in the form of knowledge graphs and raw texts.<\/p>\n<\/div>\n<\/section>\n

\n
\n

This task will first develop a novel methodology for extracting knowledge graphs from monolingual linguistic knowledge sources such as digital dictionary databases, suitable for morphologically-rich languages. Further, the task will develop methods to generate large quantities of raw text from these sources. The methodology will be applied to the Dictionary Database for Slovene (DDS), the largest open-access lexical\/lexicographic resource for the Slovene language, and Open Slovene WordNet (\u010cibej et al. 2023) which is already linked with the DDDS.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section>
\n