Challenge 1: Improving LLMs with linguistic resources and development of vision-language models

LLMs require large amounts of high-quality textual data for their training and fine-tuning for specific tasks. High-quality lexicographic data can help pretrain LLMs by producing different types of data, in particular knowledge graphs and raw text. The available information in lexicographic resources includes relations, information on sense distribution with definitions of word senses, cross-lingual connections, identification and description of idiomatic or figurative expressions, etc.

This information trove is not yet adequately utilized by LLMs but it could reduce their hallucination, improve their language proficiency in complex situations and less-resourced languages, and improve fine-tuning of LLMs for specific important tasks, such as commonsense reasoning and natural language inference.

Task 1.1 Task 1.2 Task 1.3 Yearly reports

Task 1.1: Improving LLMs with linguistic data

The first challenge will address LLM improvements with monolingual lexicographic data in the form of knowledge graphs and raw texts.

This task will first develop a novel methodology for extracting knowledge graphs from monolingual linguistic knowledge sources such as digital dictionary databases, suitable for morphologically-rich languages. Further, the task will develop methods to generate large quantities of raw text from these sources. The methodology will be applied to the Dictionary Database for Slovene (DDS), the largest open-access lexical/lexicographic resource for the Slovene language, and Open Slovene WordNet (Čibej et al. 2023) which is already linked with the DDDS.

Deliverables 1.1: DDDS and OSWN datasets ready for training (M6). Initial improved LLM (M12). Final improved LLM (M24).

Task 1.2 Improving LLMs with cross-lingual data

Large quantities of multilingual and cross-lingual linguistic data in several languages offer an opportunity to improve LLMs with further pretraining and fine-tuning for specific tasks. The knowledge-improved multilingual LLMs will be better in general sense and particularly for languages with injected multilingual resources. Such knowledge-enhanced LLMs will improve cross-lingual transfer via fine-tuning and prompt engineering, including multilingual prompt engineering. Such transfer is highly relevant for less-resourced languages such as Slovene.

We will develop a novel methodology for extracting knowledge graphs from multilingual digital dictionary databases such as Wiktionary, BabelNet and other cross-lingual resources such as DBPedia and linked WordNets. The methodology will be suitable for morphologically rich languages. Further, the task will develop methods to generate large quantities of raw text from multilingual digital dictionary databases. The extracted KGs and raw texts will be used in further pretraining of LLMs for general use. The instruction following datasets, produced by tasks in WP2 and WP5 will be used to adapt LLMs for specific cross-lingual and multilingual linguistic tasks.

Deliverables 1.2: KGs and raw texts datasets (M18). Initial improved LLMs (M24). Final improved LLMs (M30).

T1.3 Improving multimodal models

Multimodal language models, such as vision-language models (VLMs) for less-resourced languages are rare and difficult to produce, as the resources allowing matching different modalities are very rare, and obtaining them with machine translation is mostly inadequate. The challenge is to effectively create VLM models supporting less-resourced languages and domains, such as Slovene and advance methodologies for effective VLM training with less-resources via different methods for aligning modalities, such as training alternatives to contrastive learning (Radford et al., 2021), relative representations (Norelli et al., 2023), and novel methods for background knowledge grounding.

We will build a dataset for vision-language model (VLM) construction, namely an image-text dataset containing images with Slovenian captions. We expect the open-source dataset will contain from 100 thousand to 1 million image-text pairs, a minimum requirement for successful model training. Next, we will construct a VLM for Slovenian, focusing on effective VLMs training methods, such as aligning modalities methods, contrastive learning (Radford et al., 2021), relative representations (Norelli et al., 2023), and novel methods for background knowledge grounding.

Deliverables 1.3: Slovene datasets for training VLM (M12). Slovene VLM model (M24).

Challenge 1: Improving LLMs with linguistic resources and development of vision-language models

Task 1.1: Improving LLMs with linguistic data

Task 1.2 Improving LLMs with cross-lingual data

T1.3 Improving multimodal models

CONTACT

DURATION

LOCATION

FINANCED BY

Archive