Advanced grammatical analysis of multilingual corpora
By Matej Klemen
In recent times, linguistics has seen a transition from intuition-based research to data-driven approaches, fueled by the advent of large-scale corpora and advanced computational tools.This shift has led to significant new discoveries about language structure and use, particularly in the field of descriptive and comparative grammar analysis.
However, traditional corpus-based methods remain labour-intensive and often involve the use of a specialised tool or creation of a programming script, adding complexity and requiring additional non-linguistic expertise from the end user. The emergence of LLMs with reasoning capabilities offers an opportunity to simplify the workflow, as well as enhance linguistic analysis, potentially uncovering previously unidentified patterns of language use.
As part of the LLM4DH project, we will develop a novel approach to grammatical analysis of multilingual corpora by augmenting state-of-the-art LLMs with the Universal Dependencies (UD) data. The UD initiative (de Marneffe et al. 2021) has created the largest grammatically annotated dataset to date, encompassing treebanks for more than 160 languages worldwide, and has been instrumental in advancing research in linguistic typology (e.g., Levshina 2022) and other grammar-related disciplines. UD focuses on syntactic structure: how words in a sentence are connected and what grammatical roles they play — like subject, object, modifier, etc. The goal is to enable analysis and comparison of sentence structure across many different languages using the same system of labels and rules. To illustrate the idea, we will use the question about the (dominant) word order in a language as an example. The goal is to determine the most frequently occurring word order in terms of the Subject (e.g., I), Verb (e.g., ate), and Object (a pie). Figure 1 illustrates the contrast between a traditional approach to answering the question and our proposed approach.
Figure 1: A schema illustrating the contrast in answering the dominant word order query using a traditional approach (top) and our proposed approach (bottom). In our system, the data selection and intermediate analysis is done “under the hood”, saving the user time spent setting up a workflow. However, the intermediate steps performed are also available to the user, enabling their inspection and potential correction.
What we aim to do
We will develop an LLM-based system for grammatical analysis of multilingual corpora. Given a user query, the system will automatically determine the steps required to answer the query, and execute them. This involves tasks such as retrieval of relevant examples, performing analysis on relevant examples (extracting patterns), and summarising the results.
In the dominant word order example for Slovene:
- the retrieval step would select only examples in Slovene containing a subject, verb, and an object;
- the pattern extraction step would analyse the retrieved examples and count the occurrences for each of the possible word orders (SVO, SOV, VOS, …);
- the summarisation step would provide a short answer and mention potential deviations, potentially displaying illustrative examples alongside.
Our goal is to create a system capable of answering diverse queries of varying complexity about languages contained in the massively multilingual UD dataset. Using the system we will conduct linguistic research on corpus data and evaluate its potential for advancing our knowledge of the grammar of the world’s languages.
How?
Figure 2 shows an illustration of the initial system design on the dominant word order example. We will build our system using two key techniques:
- Retrieval-augmented generation (RAG). Although LLMs already contain some knowledge of linguistic concepts, they are based on unconstrained and potentially unreliable resources. RAG enables the LLMs to perform analysis on a data subset retrieved from a known and trustworthy resource. In Figure 2, the system retrieves a subset of UD data containing only examples relevant to the dominant word order query
- Function (tool) calling*. Most off-the-shelf LLMs are convenient generalist question responders, meaning they can be used to answer a diverse set of questions, but often with limited accuracy and reliability. On the other hand, specialised analysis tools exist, but their use requires specialised tool-handling skills. Function calling enables the LLMs to learn when and how to call specialised tools, fusing the best of both worlds: the convenience of querying LLMs and the reliability of specialised analysis tools. In Figure 2, the system calls the specialised STARK analysis tool instead of performing the analysis using an LLM.
*The implementation of function calling is planned in the next iteration of our system.
Figure 2: Initial design of the system for advanced grammatical analysis of multilingual corpora.
Initial challenges
Retrieval of relevant examples. Relevance is a general and sometimes vague term. LLMs might not have an aligned definition of relevance to the user. The system shall provide a clearer definition of the relevance criteria based on the user query. This will improve the retrieval accuracy and the transparency of the process for the user.
Accuracy of off-the-shelf LLMs for linguistic tasks. In our initial experiments with linguistic queries from the World Atlas of Language Structures (WALS) we have observed that LLMs struggle with grammatical tasks. To improve this, we will work on improved methods for including external knowledge from UD, and continue testing new and constantly improving models.
Citation:
Klemen, M. (2025). Advanced grammatical analysis of multilingual corpora. Zenodo. https://doi.org/10.5281/zenodo.15646857
References:
De Marneffe, M. C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal dependencies. Computational linguistics, 47(2), 255-308.
Levshina, N. (2022). Corpus-based typology: Applications, challenges and some solutions. Linguistic Typology, 26(1), 129-160.