Task 1.1:\u00a0Improving LLMs with linguistic data<\/em><\/strong><\/span><\/h3>\n<\/div><\/section>
\n\n\nThe first challenge will address LLM improvements with monolingual lexicographic data in the form of knowledge graphs and raw texts.<\/p>\n<\/div>\n<\/section>\n\n\nThis task will first develop a novel methodology for extracting knowledge graphs from monolingual linguistic knowledge sources such as digital dictionary databases, suitable for morphologically-rich languages. Further, the task will develop methods to generate large quantities of raw text from these sources. The methodology will be applied to the Dictionary Database for Slovene (DDS), the largest open-access lexical\/lexicographic resource for the Slovene language, and Open Slovene WordNet (\u010cibej et al. 2023) which is already linked with the DDDS.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section>
\n
<\/span><\/span><\/div>\n<\/div>\n<\/section>\n\n\nDeliverables 1.1: DDDS and OSWN datasets ready for training (M6). Initial improved LLM (M12). Final improved LLM (M24).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div>Task 1.<\/em>2 Improving LLMs with cross-lingual data <\/em><\/strong><\/span><\/h3>\n<\/div><\/section>
\n\n\nLarge quantities of multilingual and cross-lingual linguistic data in several languages offer an opportunity to improve LLMs with further pretraining and fine-tuning for specific tasks. The knowledge-improved multilingual LLMs will be better in general sense and particularly for languages with injected multilingual resources. Such knowledge-enhanced LLMs will improve cross-lingual transfer via fine-tuning and prompt engineering, including multilingual prompt engineering. Such transfer is highly relevant for less-resourced languages such as Slovene.<\/p>\n<\/div>\n<\/section>\n\n\nWe will develop a novel methodology for extracting knowledge graphs from multilingual digital dictionary databases such as Wiktionary, BabelNet and other cross-lingual resources such as DBPedia and linked WordNets. The methodology will be suitable for morphologically rich languages. Further, the task will develop methods to generate large quantities of raw text from multilingual digital dictionary databases. The extracted KGs and raw texts will be used in further pretraining of LLMs for general use. The instruction following datasets, produced by tasks in WP2 and WP5 will be used to adapt LLMs for specific cross-lingual and multilingual linguistic tasks.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section>
\n
<\/span><\/span><\/div>\n<\/div>\n<\/section>\n\n\nDeliverables 1.2: KGs and raw texts datasets (M18). Initial improved LLMs (M24). Final improved LLMs (M30).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div>T1.3 Improving multimodal models <\/strong><\/em><\/span><\/h3>\n<\/div><\/section>
\n\n\nMultimodal language models, such as vision-language models (VLMs) for less-resourced languages are rare and difficult to produce, as the resources allowing matching different modalities are very rare, and obtaining them with machine translation is mostly inadequate. The challenge is to effectively create VLM models supporting less-resourced languages and domains, such as Slovene and advance methodologies for effective VLM training with less-resources via different methods for aligning modalities, such as training alternatives to contrastive learning (Radford et al., 2021), relative representations (Norelli et al., 2023), and novel methods for background knowledge grounding.<\/p>\n<\/div>\n<\/section>\n\n\nWe will build a dataset for vision-language model (VLM) construction, namely an image-text dataset containing images with Slovenian captions. We expect the open-source dataset will contain from 100 thousand to 1 million image-text pairs, a minimum requirement for successful model training. Next, we will construct a VLM for Slovenian, focusing on effective VLMs training methods, such as aligning modalities methods, contrastive learning (Radford et al., 2021), relative representations (Norelli et al., 2023), and novel methods for background knowledge grounding.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section>
\n
<\/span><\/span><\/div>\n<\/div>\n<\/section>\n\n\nDeliverables 1.3: Slovene datasets for training VLM (M12). Slovene VLM model (M24).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":19,"featured_media":0,"parent":953,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"","_relevanssi_noindex_reason":"","inline_featured_image":false,"episode_type":"","audio_file":"","cover_image":"","cover_image_id":"","duration":"","filesize":"","filesize_raw":"","date_recorded":"","explicit":"","block":"","itunes_episode_number":"","itunes_title":"","itunes_season_number":"","itunes_episode_type":"","footnotes":""},"class_list":["post-1137","page","type-page","status-publish","hentry"],"acf":[],"yoast_head":"\nChallenge 1: Improving LLMs with linguistic resources and development of vision-language models - LLM4DH<\/title>\n\n\n\n\n\n\n\n\n\n\n\t\n
The first challenge will address LLM improvements with monolingual lexicographic data in the form of knowledge graphs and raw texts.<\/p>\n<\/div>\n<\/section>\n This task will first develop a novel methodology for extracting knowledge graphs from monolingual linguistic knowledge sources such as digital dictionary databases, suitable for morphologically-rich languages. Further, the task will develop methods to generate large quantities of raw text from these sources. The methodology will be applied to the Dictionary Database for Slovene (DDS), the largest open-access lexical\/lexicographic resource for the Slovene language, and Open Slovene WordNet (\u010cibej et al. 2023) which is already linked with the DDDS.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section> Deliverables 1.1: DDDS and OSWN datasets ready for training (M6). Initial improved LLM (M12). Final improved LLM (M24).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div> Large quantities of multilingual and cross-lingual linguistic data in several languages offer an opportunity to improve LLMs with further pretraining and fine-tuning for specific tasks. The knowledge-improved multilingual LLMs will be better in general sense and particularly for languages with injected multilingual resources. Such knowledge-enhanced LLMs will improve cross-lingual transfer via fine-tuning and prompt engineering, including multilingual prompt engineering. Such transfer is highly relevant for less-resourced languages such as Slovene.<\/p>\n<\/div>\n<\/section>\n We will develop a novel methodology for extracting knowledge graphs from multilingual digital dictionary databases such as Wiktionary, BabelNet and other cross-lingual resources such as DBPedia and linked WordNets. The methodology will be suitable for morphologically rich languages. Further, the task will develop methods to generate large quantities of raw text from multilingual digital dictionary databases. The extracted KGs and raw texts will be used in further pretraining of LLMs for general use. The instruction following datasets, produced by tasks in WP2 and WP5 will be used to adapt LLMs for specific cross-lingual and multilingual linguistic tasks.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section> Deliverables 1.2: KGs and raw texts datasets (M18). Initial improved LLMs (M24). Final improved LLMs (M30).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div> Multimodal language models, such as vision-language models (VLMs) for less-resourced languages are rare and difficult to produce, as the resources allowing matching different modalities are very rare, and obtaining them with machine translation is mostly inadequate. The challenge is to effectively create VLM models supporting less-resourced languages and domains, such as Slovene and advance methodologies for effective VLM training with less-resources via different methods for aligning modalities, such as training alternatives to contrastive learning (Radford et al., 2021), relative representations (Norelli et al., 2023), and novel methods for background knowledge grounding.<\/p>\n<\/div>\n<\/section>\n We will build a dataset for vision-language model (VLM) construction, namely an image-text dataset containing images with Slovenian captions. We expect the open-source dataset will contain from 100 thousand to 1 million image-text pairs, a minimum requirement for successful model training. Next, we will construct a VLM for Slovenian, focusing on effective VLMs training methods, such as aligning modalities methods, contrastive learning (Radford et al., 2021), relative representations (Norelli et al., 2023), and novel methods for background knowledge grounding.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section> Deliverables 1.3: Slovene datasets for training VLM (M12). Slovene VLM model (M24).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div>
\nTask 1.<\/em>2 Improving LLMs with cross-lingual data <\/em><\/strong><\/span><\/h3>\n<\/div><\/section>
\n
\nT1.3 Improving multimodal models <\/strong><\/em><\/span><\/h3>\n<\/div><\/section>
\n
\n