Close Menu
    What's Hot

    Why Entrepreneurs Should Study Bestselling Products

    Senseonics Holdings, Inc. (SENS) Shareholder/Analyst Call Transcript

    Belmont Stakes 2026 predictions, odds, horses, time: Win, place, show, exacta, trifecta, superfecta

    Facebook X (Twitter) Instagram
    Trending
    • Why Entrepreneurs Should Study Bestselling Products
    • Senseonics Holdings, Inc. (SENS) Shareholder/Analyst Call Transcript
    • Belmont Stakes 2026 predictions, odds, horses, time: Win, place, show, exacta, trifecta, superfecta
    • Sriram Krishnan is leaving his role as White House AI advisor
    • C.I.A. Officer Found With Gold Bars Said to Have Created Fake Spy Program
    • When Claude changed, everything changed: Managing AI blast radius in production
    • Lewis Hamilton: Ferrari driver insists he still has ‘decent pace’ after Monaco qualifying showing amid ‘negative comments’ | F1 News
    • England vs New Zealand: Pundits and cricket fans baffled as lunch taken amid sunshine during rain-hit Lord’s day | Cricket News
    interluknewsinterluknews
    • Home
    • Business
      • Corporate News
      • Industry Insights
      • Startups & Entrepreneurship
      • Technology & Innovation
    • Economy
      • Economic Policy
      • Financial Analysis
      • Inflation & Interest Rates
      • Trade & Markets
    • Global
      • Conflicts & Security
      • Diplomacy
      • Global Trends
      • International Affairs
    • Lifestyle
      • Fashion
      • Food & Dining
      • Personal Development
      • Travel
    • Opinion
      • Columns
      • Editorials
      • Expert Opinions
      • Reader Voices
    • More
      • Politics
        • Elections
        • Government & Policy
        • International Relations
        • Political Analysis
      • Sports
        • Cricket
        • Football / Soccer
        • International Sports
        • Local Sports
      • Technology
        • Artificial Intelligence
        • Cybersecurity
        • Gadgets & Reviews
        • Tech News
      • South Africa News
    Facebook X (Twitter) Instagram
    interluknewsinterluknews
    Startups & Entrepreneurship

    When Claude changed, everything changed: Managing AI blast radius in production

    adminBy adminJune 6, 2026No Comments8 Mins Read
    Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
    When Claude changed, everything changed: Managing AI blast radius in production
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Our system did one thing, and it did it well: It turned natural-language questions into API calls.

    The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like “Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city” was translated into an API call that the system could act on:

    json

    {

      “description”: “User requested sales volume for the given date range, here is the API call to get the response”,

      “api_call”: “/api/sales_volume”,

      “post_body”: {

        “start_date”: “2026-01-01”,

        “end_date”: “2026-03-31”,

        “region”: “northeast”

      }

    }

    The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.

    By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.

    The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.

    json

    {

      “description”: “User requested sales volume for the given date range, here is the API call to get the response”,

      “api_call”: “/api/sales_volume”,

      “post_body”: {

        “start_date”: “2026-01-01”,

        “end_date”: “2026-03-31”,

        “region”: “northeast”

      }

    }

    We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.

    Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.

    First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.

    Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.

    We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.

    Why traditional engineering discipline fails here

    Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.

    LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.

    This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.

    Anatomy of the failure

    The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.

    Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being “helpful” in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.

    The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe.

    Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren’t using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn’t appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem.

    The evals-first architecture

    The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal specification of the system. The prompt is an implementation of the spec. The model is an interpreter. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them.

    In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this:

    python

    def test_description_contains_no_serialized_payload(response):

        desc = response[“description”].lower()

        forbidden = [“curl”, “post_body”, “{“, ” “https://”]

        assert not any(token in desc for token in forbidden), \

            f”description leaked structured content: {response[‘description’]}”

    A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge.

    Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said “the description field should not contain a curl command,” because nobody had thought the model would put one there.

    Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves.

    The roadmap

    The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what ‘coverage’ means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between “the model passed our smoke tests” and “we know what this system will do in production” becomes the central engineering problem of the next several years.

    The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is.

    Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.

    Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.

    Welcome to the VentureBeat community!

    Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

    Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

    Blast changed Claude Managing production radius
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    Previous ArticleLewis Hamilton: Ferrari driver insists he still has ‘decent pace’ after Monaco qualifying showing amid ‘negative comments’ | F1 News
    Next Article C.I.A. Officer Found With Gold Bars Said to Have Created Fake Spy Program
    admin
    • Website

    Related Posts

    Why Entrepreneurs Should Study Bestselling Products

    June 6, 2026

    Small Business Owners Turn to AI Agents for Emails, Finances

    June 6, 2026

    First-Time Business Buyers Are Changing How Deals Get Done — Here’s What Sellers Need to Know

    June 6, 2026
    Leave A Reply Cancel Reply

    Demo
    Latest Posts

    Why Entrepreneurs Should Study Bestselling Products

    Senseonics Holdings, Inc. (SENS) Shareholder/Analyst Call Transcript

    Belmont Stakes 2026 predictions, odds, horses, time: Win, place, show, exacta, trifecta, superfecta

    Sriram Krishnan is leaving his role as White House AI advisor

    Latest Posts

    Subscribe to News

    Get the latest sports news from NewsSite about world, sports and politics.

    Advertisement
    Demo

    We are a digital news platform delivering timely, accurate, and insightful coverage of politics, global affairs, business, economy, sports, and more. Our mission is to keep readers informed with reliable news, clear analysis, and stories that truly matter.
    We're social. Connect with us:

    Facebook X (Twitter) Instagram Pinterest YouTube

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Type above and press Enter to search. Press Esc to cancel.

    Powered by
    ...
    ►
    Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.
    None
    ►
    Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.
    None
    ►
    Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.
    None
    ►
    Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.
    None
    ►
    Unclassified cookies are cookies that we are in the process of classifying, together with the providers of individual cookies.
    None
    Powered by