Close Menu
    What's Hot

    PSG beat Arsenal to win back-to-back Champions League titles after shootout | Football News

    KF: Korean Stocks Have Doubled This Year, Yet This CEF Is Still At A Discount (NYSE:KF)

    Wigan coach Matt Peet tells Warriors to enjoy Challenge Cup celebrations after win over Hull KR at Wembley | Rugby League News

    Facebook X (Twitter) Instagram
    Trending
    • PSG beat Arsenal to win back-to-back Champions League titles after shootout | Football News
    • KF: Korean Stocks Have Doubled This Year, Yet This CEF Is Still At A Discount (NYSE:KF)
    • Wigan coach Matt Peet tells Warriors to enjoy Challenge Cup celebrations after win over Hull KR at Wembley | Rugby League News
    • Mikel Arteta: Arsenal must be ‘very ambitious’ after UCL final loss
    • Liverpool U-turn for Mohamed Salah now Arne Slot is gone? – Paper Talk | Football News
    • How to watch Caitlin Clark, Fever vs. Portland Fire: TV channel, live stream, odds
    • Two of our favorite art TVs are more than 40 percent off this weekend
    • Taking a trip this summer? Beware. These travel scams are now turbocharged by AI
    interluknewsinterluknews
    • Home
    • Business
      • Corporate News
      • Industry Insights
      • Startups & Entrepreneurship
      • Technology & Innovation
    • Economy
      • Economic Policy
      • Financial Analysis
      • Inflation & Interest Rates
      • Trade & Markets
    • Global
      • Conflicts & Security
      • Diplomacy
      • Global Trends
      • International Affairs
    • Lifestyle
      • Fashion
      • Food & Dining
      • Personal Development
      • Travel
    • Opinion
      • Columns
      • Editorials
      • Expert Opinions
      • Reader Voices
    • More
      • Politics
        • Elections
        • Government & Policy
        • International Relations
        • Political Analysis
      • Sports
        • Cricket
        • Football / Soccer
        • International Sports
        • Local Sports
      • Technology
        • Artificial Intelligence
        • Cybersecurity
        • Gadgets & Reviews
        • Tech News
      • South Africa News
    Facebook X (Twitter) Instagram
    interluknewsinterluknews
    Startups & Entrepreneurship

    AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation

    adminBy adminFebruary 15, 2026No Comments7 Mins Read
    Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
    AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x reductions in cost per token.

    The dramatic cost reductions were achieved using Nvidia’s Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users.

    The 4x to 10x cost reductions reported by inference providers required combining Blackwell hardware with two other elements: optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates.

    The economics prove counterintuitive. Reducing inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs.

    “Performance is what drives down the cost of inference,” Dion Harris, senior director of HPC and AI hyperscaler solutions at Nvidia, told VentureBeat in an exclusive interview. “What we’re seeing in inference is that throughput literally translates into real dollar value and driving down the cost.”

    Production deployments show 4x to 10x cost reductions

    Nvidia detailed four customer deployments in a blog post showing how the combination of Blackwell infrastructure, optimized software stacks and open-source models delivers cost reductions across different industry workloads. The case studies span high-volume applications where inference economics directly determines business viability.

    Sully.ai cut healthcare AI inference costs by 90% (a 10x reduction) while improving response times 65% by switching from proprietary models to open-source models running on Baseten’s Blackwell-powered platform, according to Nvidia. The company returned over 30 million minutes to physicians by automating medical coding and note-taking tasks that previously required manual data entry.

    Nvidia also reported that Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large mixture-of-experts (MoE) models on DeepInfra’s Blackwell deployment. Cost per million tokens dropped from 20 cents on Nvidia’s previous Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell’s native NVFP4 low-precision format. Hardware alone delivered 2x improvement, but reaching 4x required the precision format change.

    Sentient Foundation achieved 25% to 50% better cost efficiency for its agentic chat platform using Fireworks AI’s Blackwell-optimized inference stack, according to Nvidia. The platform orchestrates complex multi-agent workflows and processed 5.6 million queries in a single week during its viral launch while maintaining low latency.

    Nvidia said Decagon saw 6x cost reduction per query for AI-powered voice customer support by running its multimodel stack on Together AI’s Blackwell infrastructure. Response times stayed under 400 milliseconds, even when processing thousands of tokens per query, critical for voice interactions where delays cause users to hang up or lose trust.

    Technical factors driving 4x versus 10x improvements

    The range from 4x to 10x cost reductions across deployments reflects different combinations of technical optimizations rather than just hardware differences. Three factors emerge as primary drivers: precision format adoption, model architecture choices, and software stack integration.

    Precision formats show the clearest impact. Latitude’s case demonstrates this directly. Moving from Hopper to Blackwell delivered 2x cost reduction through hardware improvements. Adopting NVFP4, Blackwell’s native low-precision format, doubled that improvement to 4x total. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computation per GPU cycle while maintaining accuracy. The format works particularly well for MoE models where only a subset of the model activates for each inference request.

    Model architecture matters. MoE models, which activate different specialized sub-models based on input, benefit from Blackwell’s NVLink fabric that enables rapid communication between experts. “Having those experts communicate across that NVLink fabric allows you to reason very quickly,” Harris said. Dense models that activate all parameters for every inference don’t leverage this architecture as effectively.

    Software stack integration creates additional performance deltas. Harris said that Nvidia’s co-design approach — where Blackwell hardware, NVL72 scale-up architecture, and software like Dynamo and TensorRT-LLM are optimized together — also makes a difference. Baseten’s deployment for Sully.ai used this integrated stack, combining NVFP4, TensorRT-LLM and Dynamo to achieve the 10x cost reduction. Providers running alternative frameworks like vLLM may see lower gains.

    Workload characteristics matter. Reasoning models show particular advantages on Blackwell because they generate significantly more tokens to reach better answers. The platform’s ability to process these extended token sequences efficiently through disaggregated serving, where context prefill and token generation are handled separately, makes reasoning workloads cost-effective.

    Teams evaluating potential cost reductions should examine their workload profiles against these factors. High token generation workloads using mixture-of-experts models with the integrated Blackwell software stack will approach the 10x range. Lower token volumes using dense models on alternative frameworks will land closer to 4x.

    What teams should test before migrating

    While these case studies focus on Nvidia Blackwell deployments, enterprises have multiple paths to reducing inference costs. AMD’s MI300 series, Google TPUs, and specialized inference accelerators from Groq and Cerebras offer alternative architectures. Cloud providers also continue optimizing their inference services. The question isn’t whether Blackwell is the only option but whether the specific combination of hardware, software and models fits particular workload requirements.

    Enterprises considering Blackwell-based inference should start by calculating whether their workloads justify infrastructure changes. 

    “Enterprises need to work back from their workloads and use case and cost constraints,” Shruti Koparkar, AI product marketing at Nvidia, told VentureBeat.

    The deployments achieving 6x to 10x improvements all involved high-volume, latency-sensitive applications processing millions of requests monthly. Teams running lower volumes or applications with latency budgets exceeding one second should explore software optimization or model switching before considering infrastructure upgrades.

    Testing matters more than provider specifications. Koparkar emphasizes that providers publish throughput and latency metrics, but these represent ideal conditions. 

    “If it’s a highly latency-sensitive workload, they might want to test a couple of providers and see who meets the minimum they need while keeping the cost down,” she said. Teams should run actual production workloads across multiple Blackwell providers to measure real performance under their specific usage patterns and traffic spikes rather than relying on published benchmarks.

    The staged approach Latitude used provides a model for evaluation. The company first moved to Blackwell hardware and measured 2x improvement, then adopted NVFP4 format to reach 4x total reduction. Teams currently on Hopper or other infrastructure can test whether precision format changes and software optimization on existing hardware capture meaningful savings before committing to full infrastructure migrations. Running open source models on current infrastructure might deliver half the potential cost reduction without new hardware investments.

    Provider selection requires understanding software stack differences. While multiple providers offer Blackwell infrastructure, their software implementations vary. Some run Nvidia’s integrated stack using Dynamo and TensorRT-LLM, while others use frameworks like vLLM. Harris acknowledges performance deltas exist between these configurations. Teams should evaluate what each provider actually runs and how it matches their workload requirements rather than assuming all Blackwell deployments perform identically.

    The economic equation extends beyond cost per token. Specialized inference providers like Baseten, DeepInfra, Fireworks and Together offer optimized deployments but require managing additional vendor relationships. Managed services from AWS, Azure or Google Cloud may have higher per-token costs but lower operational complexity. Teams should calculate total cost including operational overhead, not just inference pricing, to determine which approach delivers better economics for their specific situation.

    10x Blackwell costs dropped equation Hardware inference NVIDIAs
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    Previous ArticleHow Good Is This Ecommerce Platform?
    Next Article Russia-Ukraine war: List of key events, day 1,452 | Russia-Ukraine war News
    admin
    • Website

    Related Posts

    These 3 AI Shortcuts Turn Ordinary Founders Into 10x Operators

    May 30, 2026

    How to Lead Effectively When You Lack Authority

    May 30, 2026

    I put Google’s 24/7 AI assistant Gemini Spark to work, and it’s actually pretty useful

    May 30, 2026
    Leave A Reply Cancel Reply

    Demo
    Latest Posts

    PSG beat Arsenal to win back-to-back Champions League titles after shootout | Football News

    KF: Korean Stocks Have Doubled This Year, Yet This CEF Is Still At A Discount (NYSE:KF)

    Wigan coach Matt Peet tells Warriors to enjoy Challenge Cup celebrations after win over Hull KR at Wembley | Rugby League News

    Mikel Arteta: Arsenal must be ‘very ambitious’ after UCL final loss

    Latest Posts

    Subscribe to News

    Get the latest sports news from NewsSite about world, sports and politics.

    Advertisement
    Demo

    We are a digital news platform delivering timely, accurate, and insightful coverage of politics, global affairs, business, economy, sports, and more. Our mission is to keep readers informed with reliable news, clear analysis, and stories that truly matter.
    We're social. Connect with us:

    Facebook X (Twitter) Instagram Pinterest YouTube

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Type above and press Enter to search. Press Esc to cancel.

    Powered by
    ...
    ►
    Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.
    None
    ►
    Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.
    None
    ►
    Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.
    None
    ►
    Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.
    None
    ►
    Unclassified cookies are cookies that we are in the process of classifying, together with the providers of individual cookies.
    None
    Powered by