AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation

Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x reductions in cost per token.

The dramatic cost reductions were achieved using Nvidia’s Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users.

The 4x to 10x cost reductions reported by inference providers required combining Blackwell hardware with two other elements: optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates.

The economics prove counterintuitive. Reducing inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs.

“Performance is what drives down the cost of inference,” Dion Harris, senior director of HPC and AI hyperscaler solutions at Nvidia, told VentureBeat in an exclusive interview. “What we’re seeing in inference is that throughput literally translates into real dollar value and driving down the cost.”

Production deployments show 4x to 10x cost reductions

Nvidia detailed four customer deployments in a blog post showing how the combination of Blackwell infrastructure, optimized software stacks and open-source models delivers cost reductions across different industry workloads. The case studies span high-volume applications where inference economics directly determines business viability.

Sully.ai cut healthcare AI inference costs by 90% (a 10x reduction) while improving response times 65% by switching from proprietary models to open-source models running on Baseten’s Blackwell-powered platform, according to Nvidia. The company returned over 30 million minutes to physicians by automating medical coding and note-taking tasks that previously required manual data entry.

Nvidia also reported that Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large mixture-of-experts (MoE) models on DeepInfra’s Blackwell deployment. Cost per million tokens dropped from 20 cents on Nvidia’s previous Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell’s native NVFP4 low-precision format. Hardware alone delivered 2x improvement, but reaching 4x required the precision format change.

Sentient Foundation achieved 25% to 50% better cost efficiency for its agentic chat platform using Fireworks AI’s Blackwell-optimized inference stack, according to Nvidia. The platform orchestrates complex multi-agent workflows and processed 5.6 million queries in a single week during its viral launch while maintaining low latency.

Nvidia said Decagon saw 6x cost reduction per query for AI-powered voice customer support by running its multimodel stack on Together AI’s Blackwell infrastructure. Response times stayed under 400 milliseconds, even when processing thousands of tokens per query, critical for voice interactions where delays cause users to hang up or lose trust.

Technical factors driving 4x versus 10x improvements

The range from 4x to 10x cost reductions across deployments reflects different combinations of technical optimizations rather than just hardware differences. Three factors emerge as primary drivers: precision format adoption, model architecture choices, and software stack integration.

Precision formats show the clearest impact. Latitude’s case demonstrates this directly. Moving from Hopper to Blackwell delivered 2x cost reduction through hardware improvements. Adopting NVFP4, Blackwell’s native low-precision format, doubled that improvement to 4x total. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computation per GPU cycle while maintaining accuracy. The format works particularly well for MoE models where only a subset of the model activates for each inference request.

Model architecture matters. MoE models, which activate different specialized sub-models based on input, benefit from Blackwell’s NVLink fabric that enables rapid communication between experts. “Having those experts communicate across that NVLink fabric allows you to reason very quickly,” Harris said. Dense models that activate all parameters for every inference don’t leverage this architecture as effectively.

Software stack integration creates additional performance deltas. Harris said that Nvidia’s co-design approach — where Blackwell hardware, NVL72 scale-up architecture, and software like Dynamo and TensorRT-LLM are optimized together — also makes a difference. Baseten’s deployment for Sully.ai used this integrated stack, combining NVFP4, TensorRT-LLM and Dynamo to achieve the 10x cost reduction. Providers running alternative frameworks like vLLM may see lower gains.

Workload characteristics matter. Reasoning models show particular advantages on Blackwell because they generate significantly more tokens to reach better answers. The platform’s ability to process these extended token sequences efficiently through disaggregated serving, where context prefill and token generation are handled separately, makes reasoning workloads cost-effective.

Teams evaluating potential cost reductions should examine their workload profiles against these factors. High token generation workloads using mixture-of-experts models with the integrated Blackwell software stack will approach the 10x range. Lower token volumes using dense models on alternative frameworks will land closer to 4x.

What teams should test before migrating

While these case studies focus on Nvidia Blackwell deployments, enterprises have multiple paths to reducing inference costs. AMD’s MI300 series, Google TPUs, and specialized inference accelerators from Groq and Cerebras offer alternative architectures. Cloud providers also continue optimizing their inference services. The question isn’t whether Blackwell is the only option but whether the specific combination of hardware, software and models fits particular workload requirements.

Enterprises considering Blackwell-based inference should start by calculating whether their workloads justify infrastructure changes.

“Enterprises need to work back from their workloads and use case and cost constraints,” Shruti Koparkar, AI product marketing at Nvidia, told VentureBeat.

The deployments achieving 6x to 10x improvements all involved high-volume, latency-sensitive applications processing millions of requests monthly. Teams running lower volumes or applications with latency budgets exceeding one second should explore software optimization or model switching before considering infrastructure upgrades.

Testing matters more than provider specifications. Koparkar emphasizes that providers publish throughput and latency metrics, but these represent ideal conditions.

“If it’s a highly latency-sensitive workload, they might want to test a couple of providers and see who meets the minimum they need while keeping the cost down,” she said. Teams should run actual production workloads across multiple Blackwell providers to measure real performance under their specific usage patterns and traffic spikes rather than relying on published benchmarks.

The staged approach Latitude used provides a model for evaluation. The company first moved to Blackwell hardware and measured 2x improvement, then adopted NVFP4 format to reach 4x total reduction. Teams currently on Hopper or other infrastructure can test whether precision format changes and software optimization on existing hardware capture meaningful savings before committing to full infrastructure migrations. Running open source models on current infrastructure might deliver half the potential cost reduction without new hardware investments.

Provider selection requires understanding software stack differences. While multiple providers offer Blackwell infrastructure, their software implementations vary. Some run Nvidia’s integrated stack using Dynamo and TensorRT-LLM, while others use frameworks like vLLM. Harris acknowledges performance deltas exist between these configurations. Teams should evaluate what each provider actually runs and how it matches their workload requirements rather than assuming all Blackwell deployments perform identically.

The economic equation extends beyond cost per token. Specialized inference providers like Baseten, DeepInfra, Fireworks and Together offer optimized deployments but require managing additional vendor relationships. Managed services from AWS, Azure or Google Cloud may have higher per-token costs but lower operational complexity. Teams should calculate total cost including operational overhead, not just inference pricing, to determine which approach delivers better economics for their specific situation.

What's Hot

PSG beat Arsenal to win back-to-back Champions League titles after shootout | Football News

KF: Korean Stocks Have Doubled This Year, Yet This CEF Is Still At A Discount (NYSE:KF)

Wigan coach Matt Peet tells Warriors to enjoy Challenge Cup celebrations after win over Hull KR at Wembley | Rugby League News

AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation

These 3 AI Shortcuts Turn Ordinary Founders Into 10x Operators

How to Lead Effectively When You Lack Authority

I put Google’s 24/7 AI assistant Gemini Spark to work, and it’s actually pretty useful

PSG beat Arsenal to win back-to-back Champions League titles after shootout | Football News

KF: Korean Stocks Have Doubled This Year, Yet This CEF Is Still At A Discount (NYSE:KF)

Wigan coach Matt Peet tells Warriors to enjoy Challenge Cup celebrations after win over Hull KR at Wembley | Rugby League News

Mikel Arteta: Arsenal must be ‘very ambitious’ after UCL final loss

What's Hot

AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation

Production deployments show 4x to 10x cost reductions

Technical factors driving 4x versus 10x improvements

What teams should test before migrating

Related Posts

Subscribe to Updates