Challenge 5: Selected DH Challenges
Digital humanities is a broad research area that incorporates many humanities and social sciences disciplines. In this challenge, we address three selected, highly impactful challenges.
Digital humanities is a broad research area that incorporates many humanities and social sciences disciplines. In this challenge, we address three selected, highly impactful challenges.
DH research on the discourse in large language corpora has traditionally relied on unsupervised text classification techniques such as topic modeling. However, many widely used techniques are prone to overfitting and are unstable. This is especially problematic for uncovering the latent features of discourse, such as its ideological underpinnings, which require complex linguistic evidence. To address this challenge, we aim to apply LLM-generated knowledge combined with named-entity graphs. As such graphs are constituted of interconnected sets of dynamic relations between (named) entities (Hogan et al. 2021), they can be used effectively for the integration and conceptualization of underlying discursive phenomena such as ideologies (van Dijk 2017). We aim to build LLM-driven knowledge graphs for the critical-discursive analysis and create historical identities from Slovenian historical newspapers that served as key instruments of political, social, and institutional powers (van Dijk 2013). This will enable diachronic analysis of ideological changes and attendant semantic lexical shifts in historical newspaper discourse.
First, we will use named-entity graphs from T4.1 to explore the relationships between people, places, and organizations in sPeriodika 1.0, a corpus of Slovenian historical periodicals (1771–1914). We will apply a mixed methods approach to analyze the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation will focus on the emergence and development of intertwined historical identities: national, language, political, socio-economic, and religious. Second, as relations between historical identities can undergo semantic shifts through time, we will use diachronic analysis from T4.2 to study the attitudes, prejudices, and ideologies of social elites in different time periods. We will use the graphs as proxies for dynamically changing identities to investigate the perpetuation of language ideologies and investigate how such ideologies were historically tied to other identity-making aspects of individuals and places. We will also investigate diachronic semantic shifts of the lexical inventory related to the rise and fall of historical nationalisms, focusing on concepts of nationhood. We will investigate 1) oppositions between Pan-Slavic, Yugoslavian, and Slovenian identities, 2) how such notions were related to the major centers of power of the time (e.g., the Habsburg Empire until the early 20th century), and 3) how such identities incorporate smaller regional ones.
Deliverables 5.1: Novel methodological approaches to historical and ideological analysis using LLMs (M18). Novel analyses of ideological concepts through history (M36).
Fairy tales, folktales, and other ethnographic narrations, captivating narratives woven through time, are not merely timeless stories but reflections of the societies that produced and distributed them. The challenge is to automate the analysis of narrative structure with the examination of different versions of stories through time and cultures. By developing an LLM-based methodology for computational folkloristics, we aim to enhance the efficiency and accuracy of comparative studies in folkloristics, enabling large-scale diachronic, cross-lingual, and cross-culture comparisons.
We will design historical corpora of manuscripts, treatises, and ethnographic texts (memoirs, diaries, journals, travelogues, etc.) of different European origins (Spanish, French, German, English, Italian, Slovene, and Croatian) from the 15th to the 19th century, as well as a digitalized national archive of Slovene folktales. On these corpora, we will leverage LLMs for a large-scale diachronic, synchronic, and anachronic investigation into the changes of discourse in conflict resolution practices. To already digitalized old prints or typewritten material, we will apply LLMs to produce good-quality transcriptions. Previously collected illustrations (about 500) on the conflict resolution ritual will be analyzed with the vision-language model from TT4.3. Using LLMs, adapted for detecting specific motifs and myths, we will do a large-scale analysis of vendetta and peace-making rituals as they appear in ethnographic texts and folktales, covering also the motives of the banishment and the character of an outlaw hero. The approach will enable comparative analysis of Slovene and other European tales and myths of conflict resolution.
Deliverables 5.2: Novel methodology for digital folkloristics (M18). Novel analyses of conflict resolution rituals (M36).
During the legal process, there is a need to search and process large amounts of documents. These tasks can be more efficient and less error-prone by employing LLMs that support legal decision-making and information processing. An important challenge in the legal domain is contradiction detection, as it requires automatic recognition of problematic sections of legal documents. Our objective is to design a solution for legal contradiction detection using an RAG system that will automatically recognize potentially contradicting documents from a knowledge base and pass them to a contradiction detection model.
LLMs and RAG for contradiction detection from T4.4 will be adapted to the Slovenian legal domain. The system will be capable of managing Slovenian legal texts and extracting semantically and legally relevant information. We will apply the model to the task of recognizing contradictions in legal documents. First, we will gather raw legal data and create a legal database of Slovenian texts. Next, we will conduct a legal evaluation of the essential regulations, legal precedents, and literature that will be used by the RAG system. We will build the legal retrieval components and vectorize the texts. Finally, we will integrate the catheter knowledge into an RAG system. The comprehensive legal domain solution will support legal research, document analysis, identification of potential risks, legal case support, regulatory compliance, and a Q&A system.
Deliverables 5.3: Database of Slovene legal texts (M18). An RAG-based system for Slovene legal support (M36)
Content:
marko.robniksikonja@fri.uni-lj.si
Duration of the project:
September 2024 – September 2027
Faculty of Computer and Information Science
Večna pot 113, SI-1000 Ljubljana, Slovenia
Room: R2.06 (2nd floor)