
Our system did one thing, and it did it well: It turned natural-language questions into API calls.
The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like “Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city” was translated into an API call that the system could act on:
json
{
“description”: “User requested sales volume for the given date range, here is the API call to get the response”,
“api_call”: “/api/sales_volume”,
“post_body”: {
“start_date”: “2026-01-01”,
“end_date”: “2026-03-31”,
“region”: “northeast”
}
}
The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.
By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.
The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.
json
{
“description”: “User requested sales volume for the given date range, here is the API call to get the response”,
“api_call”: “/api/sales_volume”,
“post_body”: {
“start_date”: “2026-01-01”,
“end_date”: “2026-03-31”,
“region”: “northeast”
}
}
We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.
Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.
First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.
Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.
We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.
Why traditional engineering discipline fails here
Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.
LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.
This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.
Anatomy of the failure
The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.
Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being “helpful” in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.
The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe.
Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren’t using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn’t appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem.
The evals-first architecture
The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal specification of the system. The prompt is an implementation of the spec. The model is an interpreter. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them.
In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this:
python
def test_description_contains_no_serialized_payload(response):
desc = response[“description”].lower()
forbidden = [“curl”, “post_body”, “{“, ” “https://”]
assert not any(token in desc for token in forbidden), \
f”description leaked structured content: {response[‘description’]}”
A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge.
Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said “the description field should not contain a curl command,” because nobody had thought the model would put one there.
Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves.
The roadmap
The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what ‘coverage’ means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between “the model passed our smoke tests” and “we know what this system will do in production” becomes the central engineering problem of the next several years.
The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is.
Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.
Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.
Welcome to the VentureBeat community!
Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.
Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

