{"id":1184,"date":"2024-12-24T10:53:19","date_gmt":"2024-12-24T09:53:19","guid":{"rendered":"https:\/\/www.cjvt.si\/llm4dh\/?page_id=1184"},"modified":"2025-05-14T12:30:02","modified_gmt":"2025-05-14T10:30:02","slug":"work-package-6","status":"publish","type":"page","link":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/","title":{"rendered":"Challenge 6: Evaluation and Understanding of LLMs"},"content":{"rendered":"<div class=\"flex_column av_one_full  no_margin flex_column_div av-zero-column-padding first  avia-builder-el-0  el_before_av_one_full  avia-builder-el-first  \" style='margin-top:0px; margin-bottom:30px; border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><h1><strong>Challenge 6: Evaluation and Understanding of LLMs <\/strong><\/h1>\n<\/div><\/section><\/div>\n<div class=\"flex_column av_one_full  no_margin flex_column_div av-zero-column-padding first  avia-builder-el-2  el_after_av_one_full  el_before_av_tab_section  avia-builder-el-last  column-top-margin\" style='margin-top:0px; margin-bottom:30px; border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><p>Benchmarks are a crucial instrument in measuring and tracking the performance of LLMs, new versions of which are being published at an unprecedented frequency (Zhao et al. 2023). While static benchmarks do carry limitations, the most prominent being the contamination of LLMs with benchmark data (Zhou et al. 2023), they are still the most feasible solution to performance tracking for languages with a smaller number of speakers to which dynamic alternatives such as LLM arenas (Chiang et al. 2024) do not scale. To better understand LLMs, we require benchmarks that better measure higher levels of understanding, in particular, related to complex expression, figurative language, and spoken language. Equally important is the detection of biases in LLMs and direct explanations of LLM behavior for important tasks.<\/p>\n<\/div><\/section><\/div>\n<\/div><\/div><\/div><!-- close content main div --><\/div><\/div><div id='av-tab-section-1'  class='av-tab-section-container entry-content-wrapper main_color av-tab-no-transition   av-tab-above-content  avia-builder-el-4  el_after_av_one_full  avia-builder-el-last  submenu-not-first container_wrap fullsize' style=' '  ><div class='av-tab-section-outer-container'><div class='av-tab-section-tab-title-container avia-tab-title-padding-default ' ><a href='#task-6.1' data-av-tab-section-title='1' class='av-section-tab-title av-active-tab-title no-scroll av-tab-no-icon av-tab-no-image  '><span class='av-outer-tab-title'><span class='av-inner-tab-title'>Task 6.1<\/span><\/span><span class='av-tab-arrow-container'><span><\/span><\/span><\/a><a href='#task-6.2' data-av-tab-section-title='2' class='av-section-tab-title  av-tab-no-icon av-tab-no-image  '><span class='av-outer-tab-title'><span class='av-inner-tab-title'>Task 6.2<\/span><\/span><span class='av-tab-arrow-container'><span><\/span><\/span><\/a><a href='#task-6.3' data-av-tab-section-title='3' class='av-section-tab-title  av-tab-no-icon av-tab-no-image  '><span class='av-outer-tab-title'><span class='av-inner-tab-title'>Task 6.3<\/span><\/span><span class='av-tab-arrow-container'><span><\/span><\/span><\/a><a href='#task-6.4' data-av-tab-section-title='4' class='av-section-tab-title  av-tab-no-icon av-tab-no-image  '><span class='av-outer-tab-title'><span class='av-inner-tab-title'>Task 6.4<\/span><\/span><span class='av-tab-arrow-container'><span><\/span><\/span><\/a><a href='#yearly-reports' data-av-tab-section-title='5' class='av-section-tab-title  av-tab-no-icon av-tab-no-image  '><span class='av-outer-tab-title'><span class='av-inner-tab-title'>Yearly reports<\/span><\/span><span class='av-tab-arrow-container'><span><\/span><\/span><\/a><\/div><div class='av-tab-section-inner-container avia-section-default' style='width:500vw; left:0%;'><span class='av_prev_tab_section av_tab_navigation'><\/span><span class='av_next_tab_section av_tab_navigation'><\/span>\n<div data-av-tab-section-content=\"1\" class=\"av-layout-tab av-animation-delay-container av-active-tab-content __av_init_open  avia-builder-el-5  el_before_av_tab_sub_section  avia-builder-el-first   \" style='vertical-align:middle; '  data-tab-section-id=\"task-6.1\"><div class='av-layout-tab-inner'><div class='container'><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><strong><em>T6.1 Figurative language and pragmatics benchmarks <\/em><\/strong><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p>Figurative language, including metaphor, irony, and sarcasm, is a prominent feature of human communication. Although LLMs show remarkable capabilities in adapting to a particular style or creating innovative metaphors, they are still at a low level regarding the comprehension of sarcasm and irony (Yakura 2024), and struggle with understanding or detecting indirect requests and faux pas (Strachan et al. 2024). Along similar lines, LLMs significantly lag behind humans in their pragmatic capabilities, such as considering extra-linguistic context, speaker intentions, presuppositions, and implied meanings (Sravanthi et al. 2024). A major problem in assessing the capabilities of LLMs&#8217; in nuanced language understanding and generation is the lack of evaluation datasets and benchmarks. We aim to create an evaluation pipeline that will facilitate the evaluation and comparison of models with regard to figurative language and pragmatics.<\/p>\n<\/div>\n<p>We will construct and adapt several datasets and create a benchmarking pipeline for figurative language understanding, conversation handling, pragmatics reasoning, and associative behavior of LLMs. First, we will tackle 1) metaphor identification and explanation and construct a benchmark validated and augmented with human annotations. The dataset will include valid and invalid paraphrases and explanations, allowing for the evaluation in different setups, e.g., textual entailment, figurative language identification and interpretation, question answering, etc. For irony and sarcasm understanding, we will adapt and translate the Metaphor and Sarcasm Scenario Test (Adachi et al. 2004). In the pragmatic understanding benchmark, we will include implicature, presupposition, and conversation handling according to Grice&#8217;s maxims and test them with conversational AI\/chatbots (Sravanthi et al. 2024; Miehling et al. 2024). Finally, the associative behavior and association explanation benchmark will adapt WOW and the WAX dataset of association explanations (Liu et al. 2022).<\/p>\n<\/section>\n<section class=\"av_textblock_section \">\n<div class=\"avia_textblock \"><\/div>\n<\/section>\n<\/div><\/section><br \/>\n<div class=\"flex_column av_one_fifth  flex_column_div av-zero-column-padding first  avia-builder-el-8  el_after_av_textblock  el_before_av_four_fifth  column-top-margin\" style='border-radius:0px; '><span  class=\"av_font_icon avia_animate_when_visible avia-icon-animate  av-icon-style-  av-no-color avia-icon-pos-left \" style=\"\"><span class='av-icon-char' style='font-size:40px;line-height:40px;' aria-hidden='true' data-av_icon='\ue810' data-av_iconfont='entypo-fontello' ><\/span><\/span><\/div><div class=\"flex_column av_four_fifth  flex_column_div av-zero-column-padding   avia-builder-el-10  el_after_av_one_fifth  avia-builder-el-last  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><strong><em>Deliverables 6.1: Metaphor, irony, and sarcasm benchmark in Slovene (M12). Pragmatic and associative behavior explanation benchmark (M24).<\/em><\/strong><\/p>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div><div data-av-tab-section-content=\"2\" class=\"av-layout-tab av-animation-delay-container   avia-builder-el-12  el_after_av_tab_sub_section  el_before_av_tab_sub_section   \" style='vertical-align:middle; '  data-tab-section-id=\"task-6.2\"><div class='av-layout-tab-inner'><div class='container'><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><strong><em>T6.2 Spoken language understanding benchmark <\/em><\/strong><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p>While there are many written language understanding benchmarks, spoken language understanding benchmarks lag significantly behind. However, with the increased number of multimodal LLMs that support speech (Barrault et al. 2023, Hu et al. 2024, Fathullah et al. 2024), the need for spoken language understanding benchmarks is rising significantly. Th objective of this challenge is to develop a spoken language understanding benchmark for Slovenian and establish it in the evaluation platform SloBENCH to evaluate the performance of current and future speech-enabled LLMs.<\/p>\n<p>The benchmark will consist of 1) automatic speech recognition, 2) dialogue act identification, and 3) sentiment classification. All three tasks will be integrated into the SloBENCH platform, which currently hosts only one ASR task. The benchmark will contain both a verbatim (\u201cwhat has been said\u201d) and a non-verbatim, normalized version of the transcript, with various forms of disfluencies removed, and the standard language used. Multiple transcription variants will be given for numerals, acronyms and abbreviations. Two evaluations will be run: one greedy, which provides the best possible result regarding word error rate and character error rate, and another, identifying the best, but consistent path through variants given their labels. Another part of the speech recognition task will be the bias report, a result of the experiments in T6.3. The other two tasks in the benchmark, dialogue act and sentiment classification, will be the results of experiments and experience obtained in T3.2.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section><br \/>\n<div class=\"flex_column av_one_fifth  flex_column_div av-zero-column-padding first  avia-builder-el-15  el_after_av_textblock  el_before_av_four_fifth  column-top-margin\" style='border-radius:0px; '><span  class=\"av_font_icon avia_animate_when_visible avia-icon-animate  av-icon-style-  av-no-color avia-icon-pos-left \" style=\"\"><span class='av-icon-char' style='font-size:40px;line-height:40px;' aria-hidden='true' data-av_icon='\ue810' data-av_iconfont='entypo-fontello' ><\/span><\/span><\/div><div class=\"flex_column av_four_fifth  flex_column_div av-zero-column-padding   avia-builder-el-17  el_after_av_one_fifth  avia-builder-el-last  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><section class=\"av_textblock_section \">\n<div class=\"avia_textblock \"><\/div>\n<\/section>\n<section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p><strong><em>Deliverables 6.2: Speech dataset (4 hours), annotated with dialogue act and sentiment annotations (M30). Multi-reference ASR task, dialogue processing task, and sentiment in speech tasks (M36).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div><div data-av-tab-section-content=\"3\" class=\"av-layout-tab av-animation-delay-container   avia-builder-el-19  el_after_av_tab_sub_section  el_before_av_tab_sub_section   \" style='vertical-align:middle; '  data-tab-section-id=\"task-6.3\"><div class='av-layout-tab-inner'><div class='container'><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><strong><em>T6.3 Bias detection in LLMs and ASR <\/em><\/strong><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p>Assessing LLMs&#8217; inherent biases is an important and challenging aspect. One of the problems in less-resource language bias evaluation is dataset adaptation, as common datasets are frequently culturally specific (EEC, Kiritchenko &amp; Mohammad 2018), and machine translation is not adequate. Therefore, a culture-specific adaptation should be considered. Moreover, it is important to focus not only on detecting bias in the textual modality but also in image generation tasks and speech recognition. The well-established bias in the performance of ASR systems (Feng et al. 2024) makes the technology significantly less accessible to specific demographic groups. One possibility is debiasing LLMs, as they are used as an underlying technology for many downstream tasks; we aim to explore this path.<\/p>\n<\/div>\n<\/section>\n<section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p>We will analyze the bias of LLMs for both unimodal and multimodal language and vision tasks and investigate a de-biasing technique for use in reducing\/controlling the levels of bias in LLM outputs in debate settings. First, the English EEC bias evaluation dataset (Kiritchenko &amp; Mohammad 2018) will be adapted to Slovene, and we will use it with zero-shot prompting and sentence continuation generation techniques to assess the bias of LLMs. In the debate setting, different LLMs balance each other out by using the text generated during their debate for debiasing via fine-tuning and\/or chain-of-thought modeling. The debiased models will be assessed using a case of political bias. For the multimodal models, we will assess the bias using existing classifiers for attributes such as gender and ethnicity in images (e.g., CLIP, Radford et al. 2021). Bias in speech recognition will be analyzed based on the demographic traits of the speakers, including age, gender, and regional background.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section><br \/>\n<div class=\"flex_column av_one_fifth  flex_column_div av-zero-column-padding first  avia-builder-el-22  el_after_av_textblock  el_before_av_four_fifth  column-top-margin\" style='border-radius:0px; '><span  class=\"av_font_icon avia_animate_when_visible avia-icon-animate  av-icon-style-  av-no-color avia-icon-pos-left \" style=\"\"><span class='av-icon-char' style='font-size:40px;line-height:40px;' aria-hidden='true' data-av_icon='\ue810' data-av_iconfont='entypo-fontello' ><\/span><\/span><\/div><div class=\"flex_column av_four_fifth  flex_column_div av-zero-column-padding   avia-builder-el-24  el_after_av_one_fifth  avia-builder-el-last  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><section class=\"av_textblock_section \">\n<div class=\"avia_textblock \"><\/div>\n<\/section>\n<section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p><strong><em>Deliverables 6.3<\/em>: <em>Bias detection datasets for Slovene (M24). Debiasing approach for LLMs (M30). Spoken language bias detection analysis (M36).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div><div data-av-tab-section-content=\"4\" class=\"av-layout-tab av-animation-delay-container   avia-builder-el-26  el_after_av_tab_sub_section  el_before_av_tab_sub_section   \" style='vertical-align:middle; '  data-tab-section-id=\"task-6.4\"><div class='av-layout-tab-inner'><div class='container'><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3><strong><em>T6.4 Knowledge-based explanation of LLMs <\/em><\/strong><\/h3>\n<\/div><\/section><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p>Transparency and robustness in LLMs are fundamental to building trust, ensuring fairness and ethical use, fostering innovation, and complying with regulatory standards. While partial solutions in explanation exist, mostly based on LLMs self-explanation, this is far from satisfactory. In this research challenge, we will address the current lack of transparency in LLMs by a general methodology that will be applied to specific challenging domains, such as common-sense reasoning, natural language inference, paraphrasing, and computational folkloristics.<\/p>\n<\/div>\n<\/section>\n<section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p>The lack of transparency in LLMs will be addressed via a general methodology applied to specific challenging domains, such as common-sense reasoning, natural language inference, and computational folkloristics. The proposed methodology will use i) analysis of specific domain knowledge (e.g., existing ontologies, knowledge graphs, annotation instructions, etc.), ii) semi-automatic generation of explainability datasets with the help of LLMs, and iii) training LLMs on the original problem accompanied by the specific explainability datasets in both fine-tuning mode as well as retrieval-augmented mode.<\/p>\n<\/div>\n<\/section>\n<\/div><\/section><br \/>\n<div class=\"flex_column av_one_fifth  flex_column_div av-zero-column-padding first  avia-builder-el-29  el_after_av_textblock  el_before_av_four_fifth  column-top-margin\" style='border-radius:0px; '><span  class=\"av_font_icon avia_animate_when_visible avia-icon-animate  av-icon-style-  av-no-color avia-icon-pos-left \" style=\"\"><span class='av-icon-char' style='font-size:40px;line-height:40px;' aria-hidden='true' data-av_icon='\ue810' data-av_iconfont='entypo-fontello' ><\/span><\/span><\/div><div class=\"flex_column av_four_fifth  flex_column_div av-zero-column-padding   avia-builder-el-31  el_after_av_one_fifth  avia-builder-el-last  column-top-margin\" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><section class=\"av_textblock_section \">\n<div class=\"avia_textblock \"><\/div>\n<\/section>\n<section class=\"av_textblock_section \">\n<div class=\"avia_textblock \">\n<p><strong><em>Deliverable 6.4: A novel knowledge-based explanation methodology for LLM explanation (M36).<\/em><\/strong><\/p>\n<\/div>\n<\/section>\n<\/div><\/section><\/div><\/p>\n<\/div><\/div><\/div><div data-av-tab-section-content=\"5\" class=\"av-layout-tab av-animation-delay-container   avia-builder-el-33  el_after_av_tab_sub_section  avia-builder-el-last   \" style='vertical-align:middle; '  data-tab-section-id=\"yearly-reports\"><div class='av-layout-tab-inner'><div class='container'><\/div><\/div><\/div><\/div><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":19,"featured_media":0,"parent":953,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"","_relevanssi_noindex_reason":"","inline_featured_image":false,"episode_type":"","audio_file":"","podmotor_file_id":"","podmotor_episode_id":"","cover_image":"","cover_image_id":"","duration":"","filesize":"","filesize_raw":"","date_recorded":"","explicit":"","block":"","itunes_episode_number":"","itunes_title":"","itunes_season_number":"","itunes_episode_type":"","footnotes":""},"class_list":["post-1184","page","type-page","status-publish","hentry"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Challenge 6: Evaluation and Understanding of LLMs - LLM4DH<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Challenge 6: Evaluation and Understanding of LLMs - LLM4DH\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/\" \/>\n<meta property=\"og:site_name\" content=\"LLM4DH\" \/>\n<meta property=\"article:modified_time\" content=\"2025-05-14T10:30:02+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/work-packages\\\/work-package-6\\\/\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/work-packages\\\/work-package-6\\\/\",\"name\":\"Challenge 6: Evaluation and Understanding of LLMs - LLM4DH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/#website\"},\"datePublished\":\"2024-12-24T09:53:19+00:00\",\"dateModified\":\"2025-05-14T10:30:02+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/work-packages\\\/work-package-6\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/work-packages\\\/work-package-6\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/work-packages\\\/work-package-6\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Work Packages\",\"item\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/work-packages\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Challenge 6: Evaluation and Understanding of LLMs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/\",\"name\":\"LLM4DH\",\"description\":\"Work site\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.cjvt.si\\\/llm4dh\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Challenge 6: Evaluation and Understanding of LLMs - LLM4DH","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/","og_locale":"en_US","og_type":"article","og_title":"Challenge 6: Evaluation and Understanding of LLMs - LLM4DH","og_url":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/","og_site_name":"LLM4DH","article_modified_time":"2025-05-14T10:30:02+00:00","twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/","url":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/","name":"Challenge 6: Evaluation and Understanding of LLMs - LLM4DH","isPartOf":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/#website"},"datePublished":"2024-12-24T09:53:19+00:00","dateModified":"2025-05-14T10:30:02+00:00","breadcrumb":{"@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/work-package-6\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.cjvt.si\/llm4dh\/en\/"},{"@type":"ListItem","position":2,"name":"Work Packages","item":"https:\/\/www.cjvt.si\/llm4dh\/en\/work-packages\/"},{"@type":"ListItem","position":3,"name":"Challenge 6: Evaluation and Understanding of LLMs"}]},{"@type":"WebSite","@id":"https:\/\/www.cjvt.si\/llm4dh\/en\/#website","url":"https:\/\/www.cjvt.si\/llm4dh\/en\/","name":"LLM4DH","description":"Work site","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.cjvt.si\/llm4dh\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"}]}},"_links":{"self":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/pages\/1184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/comments?post=1184"}],"version-history":[{"count":8,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/pages\/1184\/revisions"}],"predecessor-version":[{"id":1594,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/pages\/1184\/revisions\/1594"}],"up":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/pages\/953"}],"wp:attachment":[{"href":"https:\/\/www.cjvt.si\/llm4dh\/en\/wp-json\/wp\/v2\/media?parent=1184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}