{"id":1184,"date":"2024-12-24T10:53:19","date_gmt":"2024-12-24T09:53:19","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?page_id=1184"},"modified":"2025-05-14T12:30:02","modified_gmt":"2025-05-14T10:30:02","slug":"work-package-6","status":"publish","type":"page","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/","title":{"rendered":"Challenge 6: Evaluation and Understanding of LLMs"},"content":{"rendered":"

Challenge 6: Evaluation and Understanding of LLMs <\/strong><\/h1>\n<\/div><\/section><\/div>\n

Benchmarks are a crucial instrument in measuring and tracking the performance of LLMs, new versions of which are being published at an unprecedented frequency (Zhao et al. 2023). While static benchmarks do carry limitations, the most prominent being the contamination of LLMs with benchmark data (Zhou et al. 2023), they are still the most feasible solution to performance tracking for languages with a smaller number of speakers to which dynamic alternatives such as LLM arenas (Chiang et al. 2024) do not scale. To better understand LLMs, we require benchmarks that better measure higher levels of understanding, in particular, related to complex expression, figurative language, and spoken language. Equally important is the detection of biases in LLMs and direct explanations of LLM behavior for important tasks.<\/p>\n<\/div><\/section><\/div>\n<\/div><\/div><\/div><\/div><\/div>

Task 6.1<\/span><\/span><\/span><\/span><\/a>Task 6.2<\/span><\/span><\/span><\/span><\/a>Task 6.3<\/span><\/span><\/span><\/span><\/a>Task 6.4<\/span><\/span><\/span><\/span><\/a>Yearly reports<\/span><\/span><\/span><\/span><\/a><\/div>
<\/span><\/span>\n

T6.1 Figurative language and pragmatics benchmarks <\/em><\/strong><\/h3>\n<\/div><\/section>
\n
\n
\n

Figurative language, including metaphor, irony, and sarcasm, is a prominent feature of human communication. Although LLMs show remarkable capabilities in adapting to a particular style or creating innovative metaphors, they are still at a low level regarding the comprehension of sarcasm and irony (Yakura 2024), and struggle with understanding or detecting indirect requests and faux pas (Strachan et al. 2024). Along similar lines, LLMs significantly lag behind humans in their pragmatic capabilities, such as considering extra-linguistic context, speaker intentions, presuppositions, and implied meanings (Sravanthi et al. 2024). A major problem in assessing the capabilities of LLMs’ in nuanced language understanding and generation is the lack of evaluation datasets and benchmarks. We aim to create an evaluation pipeline that will facilitate the evaluation and comparison of models with regard to figurative language and pragmatics.<\/p>\n<\/div>\n

We will construct and adapt several datasets and create a benchmarking pipeline for figurative language understanding, conversation handling, pragmatics reasoning, and associative behavior of LLMs. First, we will tackle 1) metaphor identification and explanation and construct a benchmark validated and augmented with human annotations. The dataset will include valid and invalid paraphrases and explanations, allowing for the evaluation in different setups, e.g., textual entailment, figurative language identification and interpretation, question answering, etc. For irony and sarcasm understanding, we will adapt and translate the Metaphor and Sarcasm Scenario Test (Adachi et al. 2004). In the pragmatic understanding benchmark, we will include implicature, presupposition, and conversation handling according to Grice’s maxims and test them with conversational AI\/chatbots (Sravanthi et al. 2024; Miehling et al. 2024). Finally, the associative behavior and association explanation benchmark will adapt WOW and the WAX dataset of association explanations (Liu et al. 2022).<\/p>\n<\/section>\n

\n
<\/div>\n<\/section>\n<\/div><\/section>
\n