Publications 2025

This list includes all the publications which were created in the context of the LLM4DH project in 2025.

Book chapters

  • Čibej, J. (2025). Kvantitativna strojna analiza razporeditve čustev na primeru Visoške kronike. In Zupan, Sosič, A. (Ed.), Emotions and Slovenian Literature (pp. 59-66). University of Ljubljana Press. 10.4312/Obdobja.44.59-66
  • Miok, K., Klemen, M., Škrlj, B., and Šikonja, M. R. (2025). Challenges in explaining pretrained clinical text classifiers. In Kopinska et al. (Eds.) Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 314–322). Springer Nature. https://doi.org/10.1007/978-3-032-19105-2_22
  • Miok, K., Škrlj, B., Zaharie, D., and Šikonja, M. R. (2025). TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning. In Kopinska et al. (Eds.) Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 499–513). Springer Nature. https://doi.org/10.1007/978-3-032-19105-2_35

Journal articles

  • Al Sahili, Z., Patras, I., and Purver, M. (2025). Data matters most: auditing social bias in contrastive vision–language models. Journal of machine learning research, 5126, 1–21. https://doi.org/10.48550/arXiv.2501.13223
  • Brglez, M., Bajt, V., Pollak, S., Rot, Š., and Martinc, M. (2025). A System for Word Usage Change Detection: Its Use in Linguistic and Sociolinguistic Studies. Contributions to the Contemporary History, 65(3), 160-188. https://doi.org/10.51663/pnz.65.3.07
  • Brglez, M., Bajt, V., Pollak, S., Rot, Š., and Martinc, M. (2025). Od kamnitega do spletnega portala: samodejno zaznavanje sprememb v rabi besed. Prispevki za novejšo zgodovino, 65(3), 160–188. https://ojs.inz.si/pnz/en/article/view/4495/5886
  • Dobrovoljc, K. (2025). Treebanking Spoken Slovenian: New Data, Models, and Lessons Learned. Contributions to Contemporary History, 65(3), 14-41. https://doi.org/10.51663/pnz.65.3.01
  • Holdt, Š., and Munda, T. (2025) Jezikovno popravljanje v digitalnem okolju: kvalitativna študija z učiteljicami in učitelji slovenščine. Sodobna pedagogika, 76(142), 86-106. 10.63384/sptB53s791s
  • Holdt, Š., Gapsa, M., Gantar, P., and Kosem, I. (2025). The Potential of ChatGPT in the Development of the Thesaurus of Modern Slovene. Contributions to the Contemporary History, 65(3), 189-217. https://doi.org/10.51663/pnz.65.3.08
  • Hostnik, M. and Robnik Šikonja, M. (2025). Retrieval-augmented code completion for local projects using large language models. Expert Systems with Applications, 292. https://doi.org/10.1016/j.eswa.2025.128596
  • Klemen, M., Arčon, T., Terčon, L., Robnik Šikonja, M., and Dobrovoljc, K. (2025) Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis. arXiv preprint arXiv:2512.00214
  • Klemen, M., Božič, M., Arhar Holdt, Š., and Robnik Šikonja M. (2025). Grammatical error correction of Slovenian school essays using large language models. Sodobna pedagogika, 76 (3), 162–176. 10.63384/sptB53z793a
  • Kuzman, T. and Ljubešić, N. (2025). LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification. IEEE Xplore, 13, 35621-35633. doi: 10.1109/ACCESS.2025.3544814
  • Malenšek, M., Završnik, A., Krajnc, S., Križnar, P., Bajec, M., and Žitnik, S. (2025). Towards contradiction detection in legal texts: corpus preparation and contradiction extraction. Slovenščina 2.0, 13(2), 179–209. https://journals.uni-lj.si/slovenscina2/article/view/22317
  • Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., … and Zesch, T. (2025). Towards better language representation in Natural Language Processing: A multilingual dataset for text-level Grammatical Error Correction. International Journal of Learner Corpus Research, 11(2), 309-335. https://doi.org/10.1075/ijlcr.24033.mas
  • Mochtak, M., Rupnik, P., Kuzman, T. and Ljubešić, N. (2025). Parlasent: mapping sentiment in political discourse with large language models. Political Research Exchange, 7(1). https://doi.org/10.1080/2474736X.2025.2508377
  • Petrič, T., Arhar Holdt, Š., and Robnik-Šikonja, M. (2024). Pomembnost realistične evalvacije: Primer popravkov sklona in števila v slovenščini z velikim jezikovnim modelom. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, 12(1), 106-130. https://doi.org/10.4312/slo2.0.2024.1.106-130
  • Terčon, L., Dobrovoljc, K., and Ljubešić, N. (2025). CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages. Contributions to the Contemporary History, 65(3), 109-134. https://doi.org/10.51663/pnz.65.3.05
  • Yadav, A., Garg, T., Klemen, M., Ulčar, M., Agarwal, B., and Robnik Šikonja, M. (2025). From translation to generative LLMs: classification of code-mixed affective tasks. IEEE transactions on affective computing, 1949-3045. 10.1109/TAFFC.2025.3553399

Conference papers

Datasets

  • Arhar Holdt, Š., Antloga, Š., Gantar, P., Munda, T., Robida, N., and Zgonc, M. (2025). Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0 [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2052
  • Kosem, I., Arhar Holdt, Š., Zgaga, K., & Arčon, T. (2025). Dataset of annotated collocation-distractor pairs COLLDIST. University of Ljubljana, Faculty of Computer and Information Science; University of Ljubljana, Centre for Language Resources and Technologies. https://www.clarin.si/repository/xmlui/handle/11356/2076
  • Kosem, I., Arhar Holdt, Š., Zgaga, K., Šešet, J., Kamenšek, U., Zaranšek, P., Ponikvar, P., & Arčon, T. (2025). Dataset of annotated headword-synonym-distractor triplets SYNDIST. Jožef Stefan Institute; University of Ljubljana, Centre for Language Resources and Technologies. http://hdl.handle.net/11356/2056
  • Large Language Models for Digital Humanities (2025). Initial Improved LLM [Data set]. LLM4DH. https://huggingface.co/tknez/GaMS-9B-Instruct-Lex
  • Large Language Models for Digital Humanities (2025). Interaction graphs of historical named entities [Data set]. LLM4DH. https://github.com/UL-FRI-LGM/kranjska-annotated
  • Large Language Models for Digital Humanities (2025). LLM with improved grammatical knowledge. LLM4DH. https://github.com/matejklemen/ud_llm
  • Large Language Models for Digital Humanities (2025). Metaphor, irony, and sarcasm benchmark in Slovene Sloprag eval [Data set]. LLM4DH. https://slobench.cjvt.si/leaderboard/view/15
  • Large Language Models for Digital Humanities (2025). Metaphor, irony, and sarcasm benchmark in Slovene Sloprag mega [Data set]. LLM4DH. https://slobench.cjvt.si/leaderboard/view/16
  • Malenšek, M, et al. (2025) Collection of Slovenian legal texts COLESLAW 1.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/2095.
  • Martinc, M. (2025). Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0 [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2050
  • Škvorc, T. and Robnik Šikonja, M. (2025). Word-sense disambiguation corpus SloDicWSD 1.0. [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2008.
  • Verdonik, D., Rupnik, P., Vidinić, J., and Ljubešić, N. (2025). Corpus of spoken Slovenian ROG-Dialog 1.0 [Data set]. CLARIN.SI. http://hdl.handle.net/11356/2073.
  • Vreš, D., Arčon, T., Čibej, J., Robnik Šikonja, M., Krek, S., Gabrovšek, D., Ježovnik, J., Kastelic, M., Kevina, D., Ledinek, N., Michelizza, M., Perdih, A., Petric, Š., and Trojar, M. Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0. CLARIN.SI. http://hdl.handle.net/11356/1971
  • Žagar, A., Dobrovoljc, K., Munda, T., Brglez, M., and Robnik Šikonja, M. (2024). Knowledge-Enhanced Winograd Schema Challenge KE-WSC 1.0, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1988.
  • Žitnik., S. and Knez, T. (2025). Lexical LLM Pretraining Corpus. [Data set]. LLM4DH. D1.1.1 – pretraining corpus

Other