Challenge 2: LLMs for Linguistics and Knowledge Management

Slovenian, with its two-million language-speaker community, represents a good example of a less-resourced language, as was also shown in the comparison provided by the European Language Equality project (Rehm & Way, 2023). The research challenge in this task is to enable the generation of quality semantic descriptions for (severely) less-resourced languages and to enable large-scale language comparisons to provide new insights into the grammar of the world’s languages and to facilitate.

Task 2.1 Task 2.2 Task 2.3 Yearly reports

T2.1 LLMs for effective lexicography

For the last thirty years, lexicographers working on the description of the lexicon attempted to use various automated procedures to analyze language data and generate lexicographic descriptions (Atkins and Rundell 2008, Gantar et al. 2016). The latest developments in generative AI triggered attempts to use new tools for lexicographic purposes (Lew 2023; Rees et al. 2023; Jakubíček & Rundell 2023). However, after the first attempts, it was found that there is a significant difference between the ability of LLMs to produce quality lexicographic content for English and for other languages, in particular for the less-resourced languages or those that are under-represented in LLMs (de Schryver, 2024).

The Digital Dictionary Database for Slovene (DDDS) will be improved on various levels of linguistic description using the models produced in T1.1. We will generate morphological and semantic data, focusing on 1) morphological paradigm generation, 2) word-sense discrimination, 3) generation of various types of definitions (semantic indicators, simplified, terminological, etc.), 4) improving collocations and examples of use; 5) attribution of labels (stylistic, normative, domain, genre, etc.) 6) description of idiomatic, figurative and metaphorical language, etc. The result will be a significantly improved DDDS, which will be, in turn, used for improving models in T1.1. All versions of DDDS will be available as publicly available datasets and via open-access API.

Deliverables 2.2: DDDS with generated lexicographic data – first version (M24), DDDS with generated lexicographic data – final version (M36)

T2.2 Neural spell- and grammar checking

Particularly for less-resourced languages, neural grammar correction development often relies on synthetic data, such as generated examples of erroneous language use. While useful for addressing data sparsity, this approach lacks authenticity and contextual richness, leading to suboptimal performance in practical applications. The issue is especially problematic in educational settings, where accurate and contextually relevant corrections are essential for effective learning and user trust. To address the challenge, we propose methodologies that combine the strengths of both synthetic and authentic language data for grammar handling.

We will utilize the data from error-annotated Lektor, KOST, and Šolar corpora (the latter detailing 180 different types of errors) for advanced LLM-based synthesis of examples with linguistic errors. For each error type, we will test different parameters, such as different types of input and wordings for the prompt, and experiment with different methods of error insertion. We will iteratively produce synthetic data, continuously refining our approach through linguistic evaluations and fine-tuning of Slovene grammar detectors to determine configurations with the most realistic outcomes. Next, we will create high-quality reference evaluation datasets with various types of Slovene texts. Besides school essays, we will cover other text genres to produce authentic open-source datasets with texts by adult L1 writers and L2 writers.

Deliverables 2.1: Synthetic language error datasets (M12). Grammar checking LLMs (M18). Authentic grammar checking evaluation datasets (M24).

T2.3 Advanced grammatical analysis of multilingual corpora

In recent decades, linguistics has seen a revolutionary transition from intuition-based research to data-driven approaches, fueled by the advent of large-scale corpora and advanced computational tools. This shift has led to significant new discoveries about language structure and use, particularly in the field of descriptive and comparative grammar analysis. However, traditional corpus-based methods remain labor-intensive and implicitly rely on pre-existing linguistic assumptions guiding the extraction of relevant patterns from corpora and their subsequent analysis. The emergence of LLMs with sophisticated reasoning capabilities offers a groundbreaking opportunity to enhance and expand these methods by streamlining and accelerating corpus linguistic analysis, as well as potentially uncovering previously unidentified patterns of language use. We will develop a novel approach to grammatical analysis of multilingual corpora by fine-tuning state-of-the-art LLMs on the Universal Dependencies (UD) dataset, which provides large-scale, reliable morphosyntactic annotations for numerous world languages. We will systematically evaluate the potential of such LLMs enhanced with explicit grammatical knowledge to provide new insights into the grammar of the world’s languages and to facilitate the linguistic analysis of language corpora in general.

We will develop and evaluate a new method for LLM-based grammatical analysis of multilingual corpora. First, we will fine-tune massively multilingual LLM, such as LLaMa-3 or T5-XXL, on the UD massively multilingual dataset. Second, we will construct a multi-layered dataset of selected state-of-the-art linguistic findings for three typical corpus linguistic tasks: data annotation, pattern extraction, and data summarization. Third, we will quantitatively and qualitatively evaluate the capabilities and limitations of the new multilingual linguistic LLM for these tasks, by also accounting for the different prompting strategies. The new model will provide novel linguistic insights into world languages, encoded in the UD dataset, and support grammatical analysis of language corpora in general.

Deliverables 2.3: LLM with improved grammatical knowledge (M12). Dataset for evaluating grammatical knowledge of LLMs (M18). Multilingual and cross-lingual grammatical analyses (M36)

Challenge 2: LLMs for Linguistics and Knowledge Management

T2.1 LLMs for effective lexicography

T2.2 Neural spell- and grammar checking

T2.3 Advanced grammatical analysis of multilingual corpora

CONTACT

DURATION

LOCATION

FINANCED BY

Archive