Publications
This list includes all the publications which were created in the context of the LLM4DH project.
Book chapters
- Čibej, J. (2025). Kvantitativna strojna analiza razporeditve čustev na primeru Visoške kronike. In Zupan Sosič, A. (ed). Obdobja (pp. 59-66). Obdobja, 44.
Journal articles
- Petrič, T., Arhar Holdt, Špela, and Robnik-Šikonja, M. (2024). Pomembnost realistične evalvacije: Primer popravkov sklona in števila v slovenščini z velikim jezikovnim modelom. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 12(1), 106-130. https://doi.org/10.4312/slo2.0.2024.1.106-130
- Kuzman, T. and Ljubešić, N. (2025). LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification. IEEE Xplore, 13, 35621-35633. doi: 10.1109/ACCESS.2025.3544814
- Mochtak, M., Rupnik, P., Kuzman, T. and Ljubešić, N. (2025). Parlasent: mapping sentiment in political discourse with large language models. Political Research Exchange, 7(1). https://doi.org/10.1080/2474736X.2025.2508377
- Yadav, A., Garg, T., Klemen, M., Ulčar, M., Agarwal, B., and Robnik Šikonja, M. (2025) From translation to generative LLMs : classification of code-mixed affective tasks. IEEE transactions on affective computing., 1949-3045. 10.1109/TAFFC.2025.3553399
- Hostnik, M. and Robnik Šikonja, M. (2025). Retrieval-augmented code completion for local projects using large language models, Expert Systems with Applications, 292, 128596. https://doi.org/10.1016/j.eswa.2025.128596
- Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., … and Zesch, T. (2025). Towards better language representation in Natural Language Processing: A multilingual dataset for text-level Grammatical Error Correction. International Journal of Learner Corpus Research, 11(2), 309-335. https://doi.org/10.1075/ijlcr.24033.mas
- Ulčar, M., Žagar, A., Armendariz, C.S., Repar, A., Pollak, S., Purver, M., and Robnik Šikonja, M. (2026). Mono- and cross-lingual evaluation of representation language models on less-resourced languages, Computer Speech & Language, 95, 101852. https://doi.org/10.1016/j.csl.2025.101852
- Dobrovoljc, K. (2025). Treebanking Spoken Slovenian: New Data, Models, and Lessons Learned. Contributions to the Contemporary History, 65(3), 14-41. https://doi.org/10.51663/pnz.65.3.01
- Terčon, L., Dobrovoljc, K., and Ljubešić, N. (2025). CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages. Contributions to the Contemporary History, 65(3), 109-134. https://doi.org/10.51663/pnz.65.3.05
- Brglez, M., Bajt, V., Pollak, S., Rot, Š., and Martinc, M. (2025). A System for Word Usage Change Detection: Its Use in Linguistic and Sociolinguistic Studies. Contributions to the Contemporary History, 65(3), 160-188. https://doi.org/10.51663/pnz.65.3.07
- Holdt, Š., Gapsa, M., Gantar, P., and Kosem, I. (2025). The Potential of ChatGPT in the Development of the Thesaurus of Modern Slovene. Contributions to the Contemporary History, 65(3), 189-217. https://doi.org/10.51663/pnz.65.3.08
Conference papers
- Holdt, Š. A., Antloga, Š., Munda, T., Pori, E., and Krek, S. (2025). From Words to Action: A National Initiative to Overcome Data Scarcity for the Slovene LLM [Conference presentation]. The Third Workshop on Resources and Representations for Under-Resourced Languages and Domains, Tallinn, Estonia. https://aclanthology.org/2025.resourceful-1.27
- Krek. S. (2025, March 24-29 ). GRAVITACIJA – Veliki jezikovni modeli za digitalno humanistiko [Conference presentation]. Jožef Stefan days, Ljubljana, Slovenia. https://dnevi.ijs.si/#DOV
- Robnik Šikonja, M. (2025, April 16-17). Veliki jezikovni modeli za slovenščino in prevajanje [Conference presentation]. Proofreading and Translation Conference 2025: The Impact of Digital Transformation on Translation, Ljubljana, Slovenia. https://lektornica.si/delavnice/jezikovne/translation-conference-2025-the-impact-of-digital-transformation-in-translation/
- Arhar Holdt, Š. (2025, April 16-17). Lektoriranje v času umetne inteligence: Kdo bo postavljal piko na UI? [Conference presentation]. Proofreading and Translation Conference 2025: The Impact of Digital Transformation on Translation, Ljubljana, Slovenia. https://lektornica.si/delavnice/jezikovne/translation-conference-2025-the-impact-of-digital-transformation-in-translation/
- Kuzman, T. (2025, April 16-17). Prednosti in tveganja uporabe ChatGPTja za prevajalce [Conference presentation]. Proofreading and Translation Conference 2025: The Impact of Digital Transformation on Translation, Ljubljana, Slovenia. https://lektornica.si/delavnice/jezikovne/translation-conference-2025-the-impact-of-digital-transformation-in-translation/
- Arčon, T. (2025, June 4). Lost in instructions? Developing a systematic approach to instruction tuning datasets for LLMs [Conference presentation]. Picacsa 2025. Ljubljana, Slovenia.
- Jelovčan, G. (2025, June 4). Grammar Error Correction Dataset for Slovene Language [Conference presentation]. Picacsa 2025. Ljubljana, Slovenia.
- Vreš, D. (2025, June 4). GaMS-9B: Pushing the Boundaries of Slovenian Large Language Models [Conference presentation]. Picacsa 2025. Ljubljana, Slovenia.
- Robnik, Šikonja, M. (2025, June 10). Projekt PoVeJMo, Gravitacija in ERA Chair projekt AI4DH [Conference presentation]. 4. Nacionalna konferenca Umetna inteligenca – nove smeri razvoja in izzivi za Slovenijo. Mengeš, Slovenia. https://dogodki.vlada.si/umetna-inteligenca-digitalna-preobrazba-prijava
- Robnik Šikonja, M. (2025, June 13). Large Language Models for Analysis of Complex Phenomena [Conference presentation]. AI Methods for Research of Folkloristic Narratives, Ljubljana, Slovenia. https://cjvt.si/llm4dh/en/blog/workshop-ai-methods-for-research-of-folkloristic-narratives/
- Arčon, T, Robnik Šikonja, M. and Tratnik, P. (2025, June 13). Motif Detection Using Large Language Models: The Cinderella Case Study [Conference presentation]. AI Methods for Research of Folkloristic Narratives, Ljubljana, Slovenia. https://cjvt.si/llm4dh/en/blog/workshop-ai-methods-for-research-of-folkloristic-narratives/
- Horvat, M., Koražija, J. and Tratnik, P. (2025, June 13). Modeling Deliberative Values in Narrative Culture Using LLMs [Conference presentation]. AI Methods for Research of Folkloristic Narratives, Ljubljana, Slovenia. https://cjvt.si/llm4dh/en/blog/workshop-ai-methods-for-research-of-folkloristic-narratives/
- Babnik, J. and Tratnik, P. (2025, June 13) The Dragon-Slayer’s Narrative: Structural Kinship and Discursive Divergence [Conference presentation]. AI Methods for Research of Folkloristic Narratives, Ljubljana, Slovenia. https://cjvt.si/llm4dh/en/blog/workshop-ai-methods-for-research-of-folkloristic-narratives/
- Babnik, J. and Martinc, M. (2025, June 13) Considering Modes: Semiotics and Multimodal AI
- [Conference presentation]. AI Methods for Research of Folkloristic Narratives, Ljubljana, Slovenia. https://cjvt.si/llm4dh/en/blog/workshop-ai-methods-for-research-of-folkloristic-narratives/
- Robnik, Šikonja, M. (2025, June 17). The importance of language data for the development of LT solutions – future steps [Conference presentation]. EU LDS Country Workshop. Ljubljana, Slovenia. https://language-data-space.ec.europa.eu/events/lds-country-workshop-slovenia-2025-06-17_en
- Kosem, I. (2025, July 2-5). Implementing AI in lexicographic workflow: challenges and opportunities [Conference presentation]. 29th International Conference of the African Association for Lexicography. https://www.afrilex.co.za/conferences
- Hüll, N. and Dobrovoljc, K. (2025). Word Order Variation in Spoken and Written Corpora: A Cross-Linguistic Study of SVO and Alternative Orders. Proceedings of the Eighth International Conference on Dependency Linguistics (Depling, SyntaxFest 2025). Ljubljana, Slovenia.
- Terčon, L. and Dobrovoljc, K. (2025). ComparaTree: A Multi-Level Comparative Treebank Analysis Tool. Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025). Ljubljana, Slovenia.
- Krsnik, L. and Dobrovoljc, K. (2025). STARK: A Toolkit for Dependency (Sub)Tree Extraction and Analysis. Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025). Ljubljana, Slovenia.
- Munda, T. and Arhar Holdt, Š. (2025). First Insights into the Syntax of Slovene Student Writing: A Statistical Analysis of Šolar 3.0 vs. Učbeniki 1.0. Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025). Ljubljana, Slovenia.
- Vintar., Š. and Javoršek, J. J. (2025). The truth is no diaper: Human and AI-generated associations to emotional words. 16th International Conference on Computational Creativity, ICCC’25.
- Gorjanc, V., Pretnar Žagar, A., Dobranić, F., and Fišer, D. (2025). Accessing Historical Periodicals: Newspaper Discourse on Slovene Language. ADHO Digital Humanities Conference 2025, DH2025. https://doi.org/10.5281/zenodo.16087978
- Ljubešić, N., Porupski, I., Rupnik, P., and Kuzman, T. (2025). Identifying Filled Pauses in Speech Across South and West Slavic Languages [Conference presentation]. The 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025). Vienna, Austria. https://aclanthology.org/2025.bsnlp-1.1/
- Ljubešić, N., Porupski, I., and Rupnik, P. (2025). Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models [Conference presentation]. InterSpeech 2025, Rotterdam, Netherlands. https://arxiv.org/abs/2505.24571
- Arčon, T., Kosem, I. and Arhar Holdt, Š. (2025, September 24). Using large language models to generate distractors for language games [Conference presentation]. 28th International Conference, Discovery Science AI 4 Science Conference, Ljubljana, Slovenia. https://ds2025.ijs.si/assets/files/978-3-032-05461-6_Book_OnlinePDF.pdf
- Brglez, M. and Vintar, Š. (2025, September 24). SloPragEval: Creating the First Pragmatics Understanding Benchmark for Slovene [Conference presentation]. 28th International Conference, Discovery Science AI 4 Science Conference, Ljubljana, Slovenia. https://ds2025.ijs.si/assets/files/978-3-032-05461-6_Book_OnlinePDF.pdf
- Klemen, M., Doborovoljc, K., Terčon, L., Hüll, N., Arčon, T., and Robnik Šikonja, M. (2025, September 24). Agentic Large Language Models for Grammatical Analysis of Multilingual Corpora [Conference presentation]. 28th International Conference, Discovery Science AI 4 Science Conference, Ljubljana, Slovenia. https://ds2025.ijs.si/assets/files/978-3-032-05461-6_Book_OnlinePDF.pdf
- Jelovčan, G., Robnik Šikonja, M., Arhar Holdt, Š., and Vreš, D. (2025, September 24). Attempt to Create Synthetic Dataset for Grammar Error Correction in Slovenian Language [Conference presentation]. 28th International Conference, Discovery Science AI 4 Science Conference, Ljubljana, Slovenia. https://ds2025.ijs.si/assets/files/978-3-032-05461-6_Book_OnlinePDF.pdf
- Koloski, B., Žejn, A., and Pollak, S. (2025, September 24).SloLitAA: Slovenian Literary Authorship Attribution via AutoML [Conference presentation]. 28th International Conference, Discovery Science AI 4 Science Conference, Ljubljana, Slovenia. https://ds2025.ijs.si/assets/files/978-3-032-05461-6_Book_OnlinePDF.pdf
- Arčon, T., Robnik Šikonja, M., and Tratnik, P. (2025, September 24). Automatic detection of folkloristic motifs with large language models: the Cinderella tale [Conference presentation]. 28th International Conference, Discovery Science AI 4 Science Conference, Ljubljana, Slovenia. https://ds2025.ijs.si/assets/files/978-3-032-05461-6_Book_OnlinePDF.pdf
- Machidon, O. M., and Machidon, A. L. (2025, September 8). Comparing OCR Pipelines for Folkloristic Text Digitization [Conference presentation]. Digital Heritage 2025 International Congress, Siena, Italy. https://arxiv.org/abs/2507.19092
- Verdonik, D. Bizjak A., and Donaj, G. (2025). Govorjena slovenščina: Structured Conversational Data Collection through Online User Interface [Conference presentation]. CLARIN Annual Conference, Vienna, Austria. https://www.clarin.eu/content/programme-clarin-annual-conference-2025
- Robnik Šikonja, A. (2025, September 17). Trends and challenges in artificial intelligence [Conference presentation]. SNC’25 Sinapsa neuroscience conference 2025, Ljubljana, Slovenia. https://www.sinapsa.org/SNC25/programme
- Konovšek, T., Pahor de Maiti Tekavčič, Gorjanc, V. (2025). Mapping Common Sense: A Digital Humanities Lens on Conceptual History. A presentation at the Images of Historical Times. Concepts, Metaphors, and Arts. 26th International Conference of the History of Concepts Group, Bologna, Italy. https://eventi.unibo.it/historical-times-bologna/abstracts
- Kosem, I. (2025, September 29). Common Sense(s) in Slovene Lexicography: Building the Digital Dictionary Database [Invited keynote speech at the conference]. 1st International Conference on Lexicology and Lexicography, Budapest, Hungary.
- Robnik, Šikonja, M. (2025, November 18). Large language models for lexicography [Invited keynote speech at the conference]. eLex 2025: Electronic lexicography in the 21st century: Intelligent Lexicography, Bled, Slovenia. https://elex.link/elex2025/keynote-speakers/
- Kallas, J., Koppel, K., Heylen, K., Ilan, Kernerman, Ostroški Anić, A., Vezzani, F., Asadpour, H., Freixa, J., Božović, P. and Arhar Hodlt, Š. (2025). Neology in Practice: Lexicographic and Terminological Approaches to Lexical Innovation [Conference presentation]. eLex 2025: Electronic lexicography in the 21st century: Intelligent Lexicography, Bled, Slovenia. https://elex.link/elex2025/wp-content/uploads/elex2025_book_of_abstracts.pdf
- Kosem, I. and Arhar Holdt, Š. (2025). Using Large Language Models to Generate Distractors for Language Games [Conference presentation]. eLex 2025: Electronic lexicography in the 21st century: Intelligent Lexicography, Bled, Slovenia. https://elex.link/elex2025/wp-content/uploads/elex2025_book_of_abstracts.pdf
- Gantar, P., Laskowski, C., and Krek, S. The lemma dilemma, Slovene version [Conference presentation]. eLex 2025: Electronic lexicography in the 21st century: Intelligent Lexicography, Bled, Slovenia. https://elex.link/elex2025/wp-content/uploads/elex2025_book_of_abstracts.pdf
- Čibej, J. (2025). Up to No Good: Exploiting Word Embeddings for an Automatic Extraction of Candidates for a Lexicon of Slovene Taboo Language [Poster presentation]. eLex 2025: Electronic lexicography in the 21st century: Intelligent Lexicography, Bled, Slovenia. https://elex.link/elex2025/wp-content/uploads/elex2025_book_of_abstracts.pdf
- Krek, S., Ponikvar, P., Repar, A., Kosem, I., and Lindemann, D. (2025). DMLEX on Wikibase: Legacy dictionaries as collaboratively editable dataset [Conference presentation]. eLex 2025: Electronic lexicography in the 21st century: Intelligent Lexicography, Bled, Slovenia. https://elex.link/elex2025/wp-content/uploads/elex2025_book_of_abstracts.pdf
- Šmajdek, U. and Bohak, C. (2025). NERVIS: An Interactive System for Graph-Based Exploration and Editing of Named Entities [Conference presentation]. 10th Human-Computer Interaction Slovenia (HCI SI) conference, Koper, Slovenia. https://arxiv.org/abs/2510.04971
- Robnik Šikonja, M. (2025). Kaj so odprti LLMs in kako jih gradimo? [Conference presentation]. ERA Knowledge Rights 21 Conference, Ljubljana, Slovenia. https://www.odipi.si/era-kr21-konferenca-slovenija-2025/program-era-kr21-konference-2025/
Datasets
- Žitnik., S. and Knez, T. (2025). Lexical LLM Pretraining Corpus. [Data set]. LLM4DH. D1.1.1 – pretraining corpus
- Škvorc, T. and Robnik Šikonja, M. (2025). Word-sense disambiguation corpus SloDicWSD 1.0. [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2008.
- Žagar, A., Dobrovoljc, K., Munda, T., Brglez, M., and Robnik Šikonja, M. (2024). Knowledge-Enhanced Winograd Schema Challenge KE-WSC 1.0, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1988.
- Vreš, D., Arčon, T., Čibej, J., Robnik Šikonja, M., Krek, S., Gabrovšek, D., Ježovnik, J., Kastelic, M., Kevina, D., Ledinek, N., Michelizza, M., Perdih, A., Petric, Š., and Trojar, M. Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0. CLARIN.SI. http://hdl.handle.net/11356/1971
- Large Language Models for Digital Humanities (2025). Initial Improved LLM [Data set]. LLM4DH. https://huggingface.co/tknez/GaMS-9B-Instruct-Lex
- Martinc, M. (2025). Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0 [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2050
- Arhar Holdt, Š., Antloga, Š., Gantar, P., Munda, T., Robida, N., and Zgonc, M. (2025). Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0 [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2052
- Large Language Models for Digital Humanities (2025). LLM with improved grammatical knowledge. LLM4DH. https://github.com/matejklemen/ud_llm
- Large Language Models for Digital Humanities (2025). Interaction graphs of historical named entities [Data set]. LLM4DH. https://github.com/UL-FRI-LGM/kranjska-annotated
- Large Language Models for Digital Humanities (2025). Metaphor, irony, and sarcasm benchmark in Slovene Sloprag eval [Data set]. LLM4DH. https://slobench.cjvt.si/leaderboard/view/15
- Large Language Models for Digital Humanities (2025). Metaphor, irony, and sarcasm benchmark in Slovene Sloprag mega [Data set]. LLM4DH. https://slobench.cjvt.si/leaderboard/view/16
- Verdonik, D., Rupnik, P., Vidinić, J., and Ljubešić, N. (2025). Corpus of spoken Slovenian ROG-Dialog 1.0 [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2073.
Other
- Klemen, M. (2025). Advanced grammatical analysis of multilingual corpora. Zenodo. https://doi.org/10.5281/zenodo.15646857
- Žitnik, S. and Knez, T. (2025). Improving Linguistic Data with LLMs. Zenodo. https://doi.org/10.5281/zenodo.15878672
- Arhar Holdt, Š., and Jelovčan, G. (2025). Behind the scenes of developing spell and grammar correction LLM for Slovenian: combining authentic and synthetic data. Zenodo. https://doi.org/10.5281/zenodo.15282208
- Ferbežar, I., Sobočan, A.M., Stabej, M., Robnik Šikonja, M, and Mišič, L. (2025, September 12). Prav(o) razumeti, prav uveljavljati: slovenščina kot pot do pravic socialnega varstva [Roundtable discussion]. Razumeti pravo: jezikovna dostopnost kot temelj enakopravnosti, Ljubljana, Slovenia. https://www.pf.uni-lj.si/novice/2025-08-21-1018-vabilo-na-interdisciplinarno-konferenco-razumeti-pravo-jezikovna-dostopnost-kot-temelj-enakopravnosti-12-septembra-2025
- Dobrovoljc. K., and Terčon, L. (Eds.). (2025). SyntaxFest: 5 Events for 1 Fest of Empirical Syntax. University of Ljubljana Press. https://doi.org/10.4312/9789612976361
- Dobranić, F., Papič, I., Robnik Šikonja, M., Jančič Bogataj, M., Tomšič, A., and Štefančič, M. (2025). Generativna umetna inteligenca v vsako vas: Komu koristi javno naročilo platforme z velikimi jezikovnimi modeli?”. https://danesjenovdan.si/generativna-umetna-inteligenca-v-vsako-vas/

