Challenge 6: Evaluation and Understanding of LLMs
Benchmarks are a crucial instrument in measuring and tracking the performance of LLMs, new versions of which are being published at an unprecedented frequency (Zhao et al. 2023). While static benchmarks do carry limitations, the most prominent being the contamination of LLMs with benchmark data (Zhou et al. 2023), they are still the most feasible solution to performance tracking for languages with a smaller number of speakers to which dynamic alternatives such as LLM arenas (Chiang et al. 2024) do not scale. To better understand LLMs, we require benchmarks that better measure higher levels of understanding, in particular, related to complex expression, figurative language, and spoken language. Equally important is the detection of biases in LLMs and direct explanations of LLM behavior for important tasks.