Challenge 6: Evaluation and Understanding of LLMs

Benchmarks are a crucial instrument in measuring and tracking the performance of LLMs, new versions of which are being published at an unprecedented frequency (Zhao et al. 2023). While static benchmarks do carry limitations, the most prominent being the contamination of LLMs with benchmark data (Zhou et al. 2023), they are still the most feasible solution to performance tracking for languages with a smaller number of speakers to which dynamic alternatives such as LLM arenas (Chiang et al. 2024) do not scale. To better understand LLMs, we require benchmarks that better measure higher levels of understanding, in particular, related to complex expression, figurative language, and spoken language. Equally important is the detection of biases in LLMs and direct explanations of LLM behavior for important tasks.

Task 6.1 Task 6.2 Task 6.3 Task 6.4 Yearly reports

T6.1 Figurative language and pragmatics benchmarks

Figurative language, including metaphor, irony, and sarcasm, is a prominent feature of human communication. Although LLMs show remarkable capabilities in adapting to a particular style or creating innovative metaphors, they are still at a low level regarding the comprehension of sarcasm and irony (Yakura 2024), and struggle with understanding or detecting indirect requests and faux pas (Strachan et al. 2024). Along similar lines, LLMs significantly lag behind humans in their pragmatic capabilities, such as considering extra-linguistic context, speaker intentions, presuppositions, and implied meanings (Sravanthi et al. 2024). A major problem in assessing the capabilities of LLMs’ in nuanced language understanding and generation is the lack of evaluation datasets and benchmarks. We aim to create an evaluation pipeline that will facilitate the evaluation and comparison of models with regard to figurative language and pragmatics.

We will construct and adapt several datasets and create a benchmarking pipeline for figurative language understanding, conversation handling, pragmatics reasoning, and associative behavior of LLMs. First, we will tackle 1) metaphor identification and explanation and construct a benchmark validated and augmented with human annotations. The dataset will include valid and invalid paraphrases and explanations, allowing for the evaluation in different setups, e.g., textual entailment, figurative language identification and interpretation, question answering, etc. For irony and sarcasm understanding, we will adapt and translate the Metaphor and Sarcasm Scenario Test (Adachi et al. 2004). In the pragmatic understanding benchmark, we will include implicature, presupposition, and conversation handling according to Grice’s maxims and test them with conversational AI/chatbots (Sravanthi et al. 2024; Miehling et al. 2024). Finally, the associative behavior and association explanation benchmark will adapt WOW and the WAX dataset of association explanations (Liu et al. 2022).

Deliverables 6.1: Metaphor, irony, and sarcasm benchmark in Slovene (M12). Pragmatic and associative behavior explanation benchmark (M24).

T6.2 Spoken language understanding benchmark

While there are many written language understanding benchmarks, spoken language understanding benchmarks lag significantly behind. However, with the increased number of multimodal LLMs that support speech (Barrault et al. 2023, Hu et al. 2024, Fathullah et al. 2024), the need for spoken language understanding benchmarks is rising significantly. Th objective of this challenge is to develop a spoken language understanding benchmark for Slovenian and establish it in the evaluation platform SloBENCH to evaluate the performance of current and future speech-enabled LLMs.

The benchmark will consist of 1) automatic speech recognition, 2) dialogue act identification, and 3) sentiment classification. All three tasks will be integrated into the SloBENCH platform, which currently hosts only one ASR task. The benchmark will contain both a verbatim (“what has been said”) and a non-verbatim, normalized version of the transcript, with various forms of disfluencies removed, and the standard language used. Multiple transcription variants will be given for numerals, acronyms and abbreviations. Two evaluations will be run: one greedy, which provides the best possible result regarding word error rate and character error rate, and another, identifying the best, but consistent path through variants given their labels. Another part of the speech recognition task will be the bias report, a result of the experiments in T6.3. The other two tasks in the benchmark, dialogue act and sentiment classification, will be the results of experiments and experience obtained in T3.2.

Deliverables 6.2: Speech dataset (4 hours), annotated with dialogue act and sentiment annotations (M30). Multi-reference ASR task, dialogue processing task, and sentiment in speech tasks (M36).

T6.3 Bias detection in LLMs and ASR

Assessing LLMs’ inherent biases is an important and challenging aspect. One of the problems in less-resource language bias evaluation is dataset adaptation, as common datasets are frequently culturally specific (EEC, Kiritchenko & Mohammad 2018), and machine translation is not adequate. Therefore, a culture-specific adaptation should be considered. Moreover, it is important to focus not only on detecting bias in the textual modality but also in image generation tasks and speech recognition. The well-established bias in the performance of ASR systems (Feng et al. 2024) makes the technology significantly less accessible to specific demographic groups. One possibility is debiasing LLMs, as they are used as an underlying technology for many downstream tasks; we aim to explore this path.

We will analyze the bias of LLMs for both unimodal and multimodal language and vision tasks and investigate a de-biasing technique for use in reducing/controlling the levels of bias in LLM outputs in debate settings. First, the English EEC bias evaluation dataset (Kiritchenko & Mohammad 2018) will be adapted to Slovene, and we will use it with zero-shot prompting and sentence continuation generation techniques to assess the bias of LLMs. In the debate setting, different LLMs balance each other out by using the text generated during their debate for debiasing via fine-tuning and/or chain-of-thought modeling. The debiased models will be assessed using a case of political bias. For the multimodal models, we will assess the bias using existing classifiers for attributes such as gender and ethnicity in images (e.g., CLIP, Radford et al. 2021). Bias in speech recognition will be analyzed based on the demographic traits of the speakers, including age, gender, and regional background.

Deliverables 6.3: Bias detection datasets for Slovene (M24). Debiasing approach for LLMs (M30). Spoken language bias detection analysis (M36).

T6.4 Knowledge-based explanation of LLMs

Transparency and robustness in LLMs are fundamental to building trust, ensuring fairness and ethical use, fostering innovation, and complying with regulatory standards. While partial solutions in explanation exist, mostly based on LLMs self-explanation, this is far from satisfactory. In this research challenge, we will address the current lack of transparency in LLMs by a general methodology that will be applied to specific challenging domains, such as common-sense reasoning, natural language inference, paraphrasing, and computational folkloristics.

The lack of transparency in LLMs will be addressed via a general methodology applied to specific challenging domains, such as common-sense reasoning, natural language inference, and computational folkloristics. The proposed methodology will use i) analysis of specific domain knowledge (e.g., existing ontologies, knowledge graphs, annotation instructions, etc.), ii) semi-automatic generation of explainability datasets with the help of LLMs, and iii) training LLMs on the original problem accompanied by the specific explainability datasets in both fine-tuning mode as well as retrieval-augmented mode.

Deliverable 6.4: A novel knowledge-based explanation methodology for LLM explanation (M36).

Challenge 6: Evaluation and Understanding of LLMs

T6.1 Figurative language and pragmatics benchmarks

T6.2 Spoken language understanding benchmark

T6.3 Bias detection in LLMs and ASR

T6.4 Knowledge-based explanation of LLMs

CONTACT

DURATION

LOCATION

FINANCED BY

Archive