NVIDIA is Breaking Down Language Barriers

NVIDIA is Breaking Down Language Barriers by releasing a powerful new set of open-source tools that are set to democratize speech AI for 25 European languages. The world of AI language models is vast, yet it operates in a surprisingly small echo chamber. Of the world’s approximately 7,000 languages, a tiny fraction are supported, leaving billions of people on the sidelines of the digital revolution.

This is a foundational leap that will enable developers to build high-quality, production-ready speech recognition and translation AI for languages with limited available data, like Croatian, Estonian, and Maltese. These tools will allow for scalable applications from multilingual chatbots and customer service voice agents to near-real-time translation services.

The heart of this new initiative is a massive, open-source dataset called Granary. This corpus contains around a million hours of audio, with nearly 650,000 hours dedicated to speech recognition and over 350,000 hours for speech translation.

But a dataset is only as good as the models it trains. That’s why NVIDIA has also released two groundbreaking new AI models:

NVIDIA Canary-1b-v2: A billion-parameter model optimized for high-quality transcription and translation. Trained on Granary, this model supports two dozen languages and English.
NVIDIA Parakeet-tdt-0.6b-v3: A more streamlined 600-million-parameter model designed for high-speed, real-time applications.

Both the Granary dataset and the two models are available on Hugging Face.

The paper behind Granary will be presented at the prestigious Interspeech conference in the Netherlands, providing a deep dive into the research. For developers eager to start building, the dataset and both models are already available on Hugging Face.

But what makes this release truly remarkable is the process behind it. Training AI requires immense amounts of data, a process that is typically slow, expensive, and tedious. To overcome this, NVIDIA’s speech AI team—collaborating with researchers from Carnegie Mellon University and Fondazione Bruno Kessler developed an automated pipeline. This innovative process uses their NeMo toolkit to transform raw, unlabelled audio into high-quality, structured data that an AI can learn from, without the need for extensive human annotation.

This is more than just a technical leap; it’s a huge step toward digital inclusivity. A developer in Riga or Zagreb can now build effective voice-powered AI tools for their local languages more efficiently than ever before. In fact, the research team found that their Granary data is so effective that it takes about half the amount of data to reach a target accuracy level compared to other popular datasets.

The new models prove this power. Canary rivals models three times its size in quality while being up to ten times faster. While, Parakeet can process a 24-minute meeting in one go, automatically identifying the languages spoken. Both models are intelligent enough to handle punctuation, capitalization, and provide word-level timestamps, making them ready for professional-grade applications.

By open-sourcing the data, the models, and the methodology behind them, NVIDIA isn’t just releasing a product. They are empowering the global developer community to accelerate innovation, paving the way for a truly multilingual digital world where AI can speak your language, no matter where you’re from.

Tags: Artificial Intelligence NVIDIA is Breaking Down Language Barriers NVIDIA's Granary and NeMo Tools Open Source