EuroSpeech: A Multilingual Speech Corpus
NeurIPS 2025 · Datasets & Benchmarks · Spotlight
S. Pfisterer, F. Grötschla, L.A. Lanzendörfer, F. Yan, R. Wattenhofer
A scalable pipeline for constructing multilingual speech datasets from parliamentary recordings across 22 European parliaments. Extracts over 61,000 hours of aligned speech in 22 languages, achieving an average 41.8% reduction in word error rates when fine-tuning existing ASR models.