If you've ever tried to move a voice prototype into production, you've met the usual suspects: the model suddenly struggles with certain accents (WER spiking 15-35% for non-native speakers), transcripts are messier on live calls than in test data (over 40% WER historically in noise), and nobody is entirely sure where some of the training audio came from. None of this is an accident; it all comes back to how your speech and audio datasets were designed and managed.
External guides on audio datasets list many of the same problems we see every day at Andovar—bias and under-representation, poor audio quality, scalability bottlenecks, inconsistent annotation, and fuzzy ethics or provenance (like $25M fines for mishandled voice data). In this article, we'll unpack each challenge, share what we've learned from multilingual projects in finance, customer service, and emerging markets, and show how we tackle these issues in our own multilingual voice data collection services and custom speech data work.
This article is part of our broader speech data strategy playbook.
Bias is rarely intentional, but it’s very real. Research on speech‑recognition bias has repeatedly shown that models trained on skewed datasets can perform significantly worse for certain accents or demographic groups. One open corpus, for example, found much higher accuracy for US English than for Indian English before fine‑tuning on a more balanced evaluation set.
Common bias sources include:
We mitigate this by designing datasets with explicit targets for languages, accents, demographics, and environments, then checking performance per group—similar to the bias‑assessment approaches described in recent speech‑assistant evaluation work.
Worried your speech data might be biased?
We can audit your existing speech datasets for accent, demographic, and environment coverage, then design targeted collection and annotation projects to close the gaps.
Ask us for a dataset review
Even the best model can’t fix unusable audio. External audio‑dataset guides consistently flag noisy, inconsistent recordings as one of the top reasons ASR systems underperform outside the lab.
Typical problems include:
These issues don’t just lower accuracy—they can create uneven performance across users if some groups tend to be recorded in noisier environments. That’s both a quality and fairness problem.
How we handle it at Andovar:
This aligns with best‑practice advice to combine clean, high‑quality reference data with representative noisy conditions, not to rely solely on uncontrolled real‑world recordings.
Creating high‑quality multilingual audio datasets is logistically complex. Commentaries on multilingual dataset building list language complexity, data‑collection scale, and cultural sensitivity as recurring issues that can derail projects.
The main scaling pain points we see:
Our approach:
This mirrors external guidance that recommends a structured,step‑by‑step process for multilingual audio datasets—from defining goals todata collection, metadata design, and validation.
Need to scale speech data collection without losing control?
We’ve run multilingual voice data projects across dozens of languages and markets, combining local recruiting with centralised quality control.
Annotation and metadata are where raw audio becomes usable training signal. When they’re inconsistent, you effectively teach your model conflicting lessons. Analyses of speech‑dataset quality point out that noisy labels, inconsistent guidelines, and weak dataset documentation candramatically reduce model reliability.
Common problems:
We tackle this by:
This echoes external advice that stresses tight guidelines and validation rounds to keep multilingual audio datasets consistent and usable over time.
Finally, you can’t ignore the ethical and legal side. Voice is personal, and more work is now focusing on how to audit audio datasets for consent, licensing, and fairness. If you don’t know where data came from or under what terms, you’re carrying a hidden liability.
External guidance on multilingual audio datasets and speech data ethics highlights a few key expectations:
This is why we design custom speech data projects with consent, licensing, and metadata baked in, so you’re not relying on opaque legacy corpora that may be hard to defend later.
Check out our broader speech data strategy playbook.
Project Overview
Andovar collaborated with a major Asian automotive manufacturer to build a 100,000+ hour speech dataset for in-car voice controls across 15 languages, including Japanese, Mandarin, Thai, and European dialects. The project tackled real-world hurdles head-on, delivering a robust, ethical dataset for hands-free navigation, climate control, and entertainment systems.
Challenges Addressed
Results and Impact
The voice AI achieved 92% recognition accuracy in noisy cabins (vs. 70% baseline), reduced user frustration by 40%, and passed automotive safety certifications. Production rollout spanned 5 million vehicles without compliance issues.
The most common mistake is assuming that “more hours” automatically means“better.” In reality, misaligned or biased speech data—for example, over‑representing a single accent—can lead to models that perform up to tens of percentage points worse for under‑represented users, as bias analyses in speech systems have shown.
Start by looking at demographics and environments: languages, accents, age, gender, and recording conditions. Then evaluate your model separately on each group, following approaches recommended in bias‑assessment work for voice assistants and speech recognition. If performance gaps are large and consistent, you likely have bias in your data or system design.
Augmentation (adding noise, changing speed, etc.) is useful, but it’s not a substitute for collecting diverse, high‑quality custom speech data inthe languages, accents, and environments that matter to you. It can’t fully simulate the range of human variation or fix missing demographics.
Follow a structured process: define goals, design your data structure and metadata, recruit local speakers, and run centralised QC. Providers like Andovar can handle contributor sourcing, studio and in‑the‑wild recording, and QC across languages through our multilingual voice data collection services.
You don’t have to throw it away. We can audit what you have, improve annotation and metadata via our multilingual data annotation services, and then design custom speech data projects to fill coverage gaps. This matches the data‑centric AI trend: improving existing datasets can be as impactful as collecting new ones.
About the Author: Steven Bussey
A Fusion of Expertise and Passion: Born and raised in the UK, Steven has spent the past 24 years immersing himself in the vibrant culture of Bangkok. As a marketing specialist with a focus on language services, translation, localization, and multilingual AI data training, Steven brings a unique blend of skills and insights to the table. His expertise extends to marketing tech stacks, digital marketing strategy, and email marketing, positioning him as a versatile and forward-thinking professional in his field....More