Perplexity AI has introduced a significant upgrade in AI model deployment by using Nvidia’s latest GB200 NVL72 hardware. This transition represents a major leap in performance metrics, especially for the company’s post-trained Qwen3 235B mixture-of-experts (MoE) models. These advancements improve operational efficiency and reshape expectations for AI hardware utilization.
Performance Enhancements with GB200
The GB200 NVL72 racks, featuring 72 high-bandwidth memory GPUs each, demonstrate impressive technological advancements. Each GPU has 180 GB of memory interconnected via 72-way NVLink, providing an extraordinary 1,800 GB/s of bandwidth. This setup is essential for supporting high-throughput inference on large models, going beyond just training capabilities.
Perplexity's technical research reveals a significant reduction in latency for key operations. Latency for NVLink all-reduce operations has decreased from 586.1 microseconds on the previous Hopper generation to 313.3 microseconds with the GB200, a 46% reduction. The time required for MoE prefill combines has also dropped from 730.1 microseconds to 438.5 microseconds, marking a 40% improvement. These metrics highlight how the new hardware can better meet the increasing demands of AI processes.

Cost Efficiency and Real-Time Inference
Even more impressive is the reported ability to achieve up to 30 times real-time inference capability for certain configurations compared to the H100 baselines. This performance leap leads to significant cost reductions in inference execution, making it economically viable to deploy large MoE models at scale.
The combination of advanced hardware and sophisticated software optimizations drives these gains. By implementing Blackwell-native quantization, Perplexity reduces model weight precision without compromising output quality. The separation of prompt processing from token generation, known as prefill/decode disaggregation, further streamlines model performance. Custom kernels designed for the 235-billion-parameter MoE model optimize the entire process, boosting overall efficiency.
Implications for the AI Hardware Landscape
The effects of these advancements extend beyond Perplexity AI. As the AI hardware race heats up, the GB200 NVL72 setup represents a significant shift in how large-scale AI models will be deployed in the future. The ability to drastically lower inference costs while improving output quality sets a new benchmark for both hardware manufacturers and AI developers. With these developments, Nvidia strengthens its position as a key player in the AI infrastructure market, as companies increasingly seek innovative solutions for managing complex models.
As AI applications continue to expand, the demand for powerful, efficient hardware solutions like the GB200 is expected to rise. Perplexity's successful integration of this technology not only demonstrates the potential of Nvidia's advancements but also marks a crucial moment in the ongoing evolution of AI infrastructure.
The stories that move AI & crypto markets — before the market reacts.
Free. 7am ET. Five stories. 62,400 readers.



