Challenge 3: LLMs for Spoken Language
LLMs are fundamentally dependent on data. In language use, substantial variations exist among different types of discourse. The majority of linguistic data derive from written resources. Extensive spoken data resources, if accessible for research, are typically owned by various companies or institutions and are rarely available to the research community. However, even when these data are publicly accessible, they differ significantly from the speech used in everyday conversations. Features such as looser sentence structures, interruptions, cross-talk, silences, repairs, repetitions, misarticulations, clarifications, backchannels, conflicts, slurs, and jargon (Yeomans et al., 2023) exemplify surface-level specifics of conversation and indicate that investigating such data has great impact potential in the linguistic research (Love et al. 2014) as well as poses significant challenges for artificial intelligence (Wahlster 2023). In this challenge, we aim to advance speech technologies and research through the newest methodologies in collecting, filtering, automatically transcribing, and pragmatically processing speech data with the help of LLMs.