NVIDIA Debuts Nemotron 3 Nano Omni, An Open Multimodal Model for Vision, Audio and Language

New 30B-A3B hybrid model combines perception encoders to accelerate agent workflows and is released with open weights and training artifacts

By Jordan Park NVDA

NVIDIA Debuts Nemotron 3 Nano Omni, An Open Multimodal Model for Vision, Audio and Language

NVDA

NVIDIA introduced Nemotron 3 Nano Omni on Tuesday, a unified multimodal model that merges vision and audio encoders inside a 30B-A3B hybrid mixture-of-experts architecture. The model accepts text, images, audio, video, documents, charts and graphical interfaces as inputs and produces text outputs, touting a 256K context window, Conv3D and EVS technologies, leaderboard wins, and broad early adoption and evaluation by enterprise customers.

Key Points

Nemotron 3 Nano Omni is an open multimodal model that combines vision, audio and language in a single 30B-A3B hybrid mixture-of-experts architecture, removing the need for separate perception models.
NVIDIA claims the model achieves up to 9x higher throughput versus comparable open omni models and supports a 256K context window with Conv3D and EVS technologies.
The model is already being adopted or evaluated by a range of companies across AI, enterprise software and hardware sectors, and it is available via Hugging Face, OpenRouter, build.nvidia.com, NVIDIA Cloud Partners and cloud providers.

NVIDIA (NASDAQ:NVDA) on Tuesday unveiled Nemotron 3 Nano Omni, an open multimodal AI model that integrates vision, audio and language processing into a single system intended to power interactive AI agents.

Rather than relying on separate perception models, Nemotron 3 Nano Omni incorporates both vision and audio encoders within a 30B-A3B hybrid mixture-of-experts architecture. NVIDIA says this combined design can deliver up to 9x greater throughput compared with other open omni models that offer similar levels of interactivity.

The model accepts a wide range of inputs - including text, still images, audio clips, video, documents, charts and graphical user interfaces - and produces text as its output. Nemotron 3 Nano Omni supports a 256K context window and implements Conv3D and EVS technologies as part of its architecture.

According to NVIDIA, the model has reached the top positions on six leaderboards that measure document intelligence as well as video and audio understanding.

Adoption and evaluation

NVIDIA listed a number of companies that have begun adopting the model, including Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir and Pyler. Several other firms are evaluating the model, among them Dell Technologies, DocuSign, Infosys, K-Dense, Lila, Oracle and Zefr.

Gautier Cloix, CEO of H Company, said the model allows agents to quickly interpret full HD screen recordings. In preliminary tests on the OSWorld benchmark, H Company’s computer usage agent powered by Nemotron 3 Nano Omni processed visual reasoning using a native input resolution of 1920×1080 pixels.

Workflows, compatibility and openness

NVIDIA designed Nemotron 3 Nano Omni to function alongside other models in the Nemotron 3 family, such as Nemotron 3 Super and Nemotron 3 Ultra, and to interoperate with proprietary models from other providers. The company highlights the model’s suitability for agentic workflows that include computer use automation, document intelligence tasks and audio-video reasoning.

The release includes open weights, datasets and descriptions of training techniques. Organizations that need to customize the model can use NVIDIA NeMo, and deploy the result in environments that satisfy regulatory or data localization requirements, NVIDIA said.

Availability and distribution

Nemotron 3 Nano Omni was made available on Tuesday through several channels: Hugging Face, OpenRouter and build.nvidia.com as an NVIDIA NIM microservice. It is also accessible via NVIDIA Cloud Partners, a range of inference platforms and cloud service providers.

NVIDIA also reported that the broader Nemotron 3 family has surpassed 50 million downloads over the past year.

This article presents the technical and market details released by NVIDIA on the Nemotron 3 Nano Omni model, including architecture, supported input types, partner adoption, benchmark outcomes and distribution methods.

Risks

Integration uncertainty - Organizations evaluating or adopting the model may face implementation and integration challenges within enterprise IT and inference platforms, affecting software and cloud service providers.
Regulatory and data localization constraints - Deployments requiring compliance with regulatory or localization mandates may limit where and how the open model can be used, impacting industries with strict data rules.
Performance claims versus alternatives - NVIDIA’s throughput and leaderboard assertions reflect comparative performance; customers will need to validate those claims in their own benchmarks and workloads, particularly in document, audio and video reasoning use cases.

Menu

NVIDIA Debuts Nemotron 3 Nano Omni, An Open Multimodal Model for Vision, Audio and Language

Key Points

Risks

More from Stock Markets