NVIDIA’s Nemotron 3 Nano Omni Trades Accuracy for Multimodal Throughput
Károly Zsolnai-Fehér’s account of NVIDIA’s Nemotron 3 Nano Omni argues that the 30-billion-parameter open multimodal model is notable less for leading general intelligence benchmarks than for processing long video, audio, images and documents quickly and cheaply. The reported advantage comes from compression across the system — Mamba layers, audio tokenization, aspect-ratio-preserving vision handling, distilled encoders and efficient video sampling — which reduces the amount of material sent into the language-model backbone.

Nemotron 3 Nano Omni is optimized for multimodal throughput, not general supremacy
Károly Zsolnai-Fehér frames NVIDIA’s Nemotron 3 Nano Omni as a 30-billion-parameter, open, free multimodal model whose point is not that it beats every other open model on intelligence. The value claim is narrower and more operational: it processes images, video, audio, and documents unusually quickly and cheaply.
The headline number is video throughput. The model can process almost 10 hours of video per hour — “nearly 10 times real-time,” in Zsolnai-Fehér’s phrasing. NVIDIA’s benchmark highlights the FP8 version of Nemotron 3 Nano Omni at 9.91 hours of video processed per hour, $0.88 per hour of video, and an F1 macro score of 0.310. The narration says this is almost three times faster than Qwen Omni; the benchmark table lists Qwen3 Omni at 3.91 hours of video per hour and $1.13 per hour of video. Document processing, Zsolnai-Fehér adds, can be up to seven times faster.
The comparison places Nemotron’s advantage in speed and cost rather than top quality. GPT 4.1 has a higher F1 macro score, 0.395, but lower throughput and higher cost per hour of video. Qwen3 Omni has the same F1 macro score as Nemotron, 0.310, but lower throughput and higher cost per hour.
| Model | Open/proprietary | F1 macro | Throughput | Cost per 1 hr video |
|---|---|---|---|---|
| Amazon Nova 2 | Proprietary | 0.377 | 0.68 | $5.88 |
| Gemini 2.5 Pro | Proprietary | 0.345 | 2.45 | $2.13 |
| Gemini 3.0 Pro | Proprietary | 0.340 | 1.52 | $3.00 |
| GPT 4.1 | Proprietary | 0.395 | 1.92 | $2.04 |
| NVIDIA Nemotron 3 Nano Omni FP8 | Open | 0.310 | 9.91 | $0.88 |
| Qwen3 Omni | Open | 0.310 | 3.91 | $1.13 |
The quality-versus-cost chart makes the same tradeoff visible: bubble size represents throughput, while the highlighted Nemotron 3 Nano Omni FP8 point sits at 0.310 F1 macro, $0.88 per hour of video, and 9.91 hours processed per hour. The practical claim is not that it dominates every model on accuracy. It is that it occupies a useful position for high-volume multimodal work.
That speed does not make the model a phone-class system. Running it locally requires “something like” a strong desktop GPU, with about 25 GB of video memory. A technical excerpt from NVIDIA lists a model-weight footprint of 20.9 GB for a quantized version, versus a 61.5 GB BF16 reference, with additional headroom needed for the KV cache. The practical conclusion is that this is a local or cloud GPU model, not something to expect on a handset.
The strange speed comes from compression at every multimodal boundary
The performance explanation has five parts: Mamba layers, audio tokenization without a separate speech recognizer, aspect-ratio-preserving image handling, a distilled vision encoder, and efficient video sampling. The common theme is that Nemotron cuts redundant or expensive representations before they become large language model context.
The first point is architectural. Mamba layers scale linearly with context length rather than quadratically. In Károly Zsolnai-Fehér’s explanation, the advantage grows as inputs get longer: more documents, longer videos, and longer audio all make the scaling difference more valuable. For online systems processing large multimodal inputs at scale, that is where the model becomes “incredible.”
The more documents you have, the longer video or audio you have, the bigger the advantage this one has.
The second point is audio. In many multimodal systems, spoken input passes through a separate speech recognition model before reaching the language model. Zsolnai-Fehér says that path is often large and expensive, and can strip away emotion and tone. Nemotron instead uses an audio adaptor that converts audio into tokens; in his telling, the design keeps more of the relevant audio signal while avoiding the cost of running a whole separate model such as Whisper on top. A NVIDIA technical excerpt says the system yields approximately 12.5 tokens per second of audio, or about 80 milliseconds per token, and segments audio streams into 30-second clips corresponding to roughly 375 tokens per clip.
The third point is video representation. Zsolnai-Fehér contrasts older techniques that force images into a fixed aspect ratio with Nemotron’s approach, which keeps the aspect ratio. The system diagram labels the visual pipeline with dynamic resolution and full-image handling. For video, he highlights 3D convolution: instead of treating frames entirely frame by frame, the model looks at packages of frames together. That lets it compress temporal information before sending it onward.
The architecture diagram for Nemotron 3 Nano Omni makes these compression points concrete. Audio enters through a Parakeet audio encoder and audio adaptor; video frames and images pass through efficient video sampling, a C-RadioV4-H vision encoder, 3D convolution, and a vision adaptor; text enters through a tokenizer. All of those streams feed the Nemotron 3 Nano 30B-A3B LLM backbone. The relevant design pattern is not simply that the model accepts multiple modalities, but that each modality is encoded, sampled, or adapted before it reaches the backbone.
A technical excerpt gives the scale of the video-token reduction: a 512-frame video produces about 141,000 input tokens without either mechanism; with Conv3D enabled, it drops to about 75,000 tokens, a 47% reduction; with Conv3D combined with efficient video sampling at q = 0.5, it drops further to about 42,000 tokens, a 70% reduction versus baseline.
| 512-frame video processing setup | Approximate LLM input tokens | Reduction versus baseline |
|---|---|---|
| Without Conv3D or efficient video sampling | ~141k | Baseline |
| With Conv3D | ~75k | −47% |
| With Conv3D and efficient video sampling at q = 0.5 | ~42k | −70% |
The fourth point is the vision encoder. Zsolnai-Fehér says one might expect a large standalone CLIP-style model for matching images and text. Instead, NVIDIA distills three models into one smaller encoder: one for matching images to text, one for fine details, and one for object segmentation. The RADIO visual describes a vision foundation model trained from DINOv2, SAM, and CLIP and producing general-purpose feature maps for pixel-level tasks, text grounding, and semantic segmentation.
The fifth point is Efficient Video Sampling. Even after compressing blocks of frames, a long video still contains many frames that are not completely unique. Zsolnai-Fehér’s example is shared background information across frames. The described sampling step discards duplicate information so the model spends less context and compute on visual material that adds little.
The license is usable, but not Apache 2.0
Károly Zsolnai-Fehér explicitly separates openness from permissiveness. His preferred benchmark is Apache 2.0, which he calls highly permissive. NVIDIA’s model does not use Apache 2.0; the terms say use is governed by the NVIDIA Open Model Agreement. His initial reaction is cautious: “That’s usually not great news.”
After looking at the terms, however, his assessment is more favorable than that first reaction. The license text says the works are commercially usable, users are free to create and distribute derivative works, and NVIDIA does not claim ownership of outputs generated using the works or derivative works. Commercial use and derivative works, in his reading, are fine.
The limits are that the license needs some attribution and is stricter on patent grants than Apache 2.0. He rates it “7 out of 10” if Apache 2.0 is a 10.
If Apache 2.0 were a 10 out of 10, this is a 7 out of 10 in my opinion.
That matters because the model is presented as open and runnable by users, not merely accessible through a vendor API. But Zsolnai-Fehér does not treat “open” as a binary property. The licensing terms are part of the practical evaluation, alongside performance and hardware requirements.
For text-only reasoning and coding, he would look elsewhere
The model’s weakness is the mirror image of its specialization. Károly Zsolnai-Fehér says that if the task is pure text reasoning or pure coding, he would “probably look elsewhere.” Nemotron 3 Nano Omni is not, in his wording, “the number one smartest open model.”
The text-only benchmark table supports that caution. It compares Nemotron 3 Nano Omni with the text-only Nemotron 3 Nano 30B-A3B and Qwen3-Omni across selected benchmarks. The multimodal model is close to the text-only Nemotron on some measures, but lower on several important ones: MMLU-Pro is 77.3 versus 78.3; GPQA without tools is 72.2 versus 73.0; LiveCodeBench is 63.2 versus 68.3; AIME25 without tools is 82.1 versus 89.1; SciCode is 32.0 versus 33.3. It is higher on IFBench prompt, AA-LCR, and TauBench V2 Telecom.
| Benchmark | Nemotron 3 Nano Omni | Nemotron 3 Nano 30B-A3B text-only |
|---|---|---|
| MMLU-Pro | 77.3 | 78.3 |
| GPQA (no tools) | 72.2 | 73.0 |
| LiveCodeBench | 63.2 | 68.3 |
| AIME25 (no tools) | 82.1 | 89.1 |
| IFBench (prompt) | 74.2 | 71.5 |
| AA-LCR | 41.0 | 35.9 |
| TauBench V2 (Telecom) | 42.7 | 42.2 |
| SciCode | 32.0 | 33.3 |
A document excerpt states the design goal directly: the Omni model aims to maintain the text benchmarks of the LLM while adding vision and audio understanding capabilities. Zsolnai-Fehér’s evaluation is that the result should be chosen for multimodal workloads, especially audio and video that must be processed quickly and cheaply, not for cases where the best available text-only or coding model is the right tool.
His broader point is that open models are beginning to specialize. He links Nemotron to a pattern in which free and open systems can be owned and run by users, while becoming strong in different directions rather than converging on one universal ranking. For Nemotron 3 Nano Omni, the direction is clear: high-throughput multimodal processing under practical cost constraints.


