You Can't Build Good LLM Systems on Vague Requirements

TL;DR

Many LLM teams move too quickly into prompting, agent behavior, or service design before they have defined the basics: what the system should do, what it should not do, how edge cases should be handled, and how success will be evaluated.

I have seen this firsthand while building chatbots where requirements were vague, product feedback was mostly reactive, and evaluation discipline came too late. In one case, later analysis showed that the questions product thought were "in scope" covered only about 30% of the actual in-scope user utterances in transcript data. By then, our prompts, policies, evals, and performance metrics had all been shaped around an incomplete picture.

The lesson is simple: with LLM systems, unclear requirements do not create flexibility. They create rework. If you do not define boundaries, test sets, and expectations early, you are not iterating efficiently. You are tuning in the dark.

Before you optimize an LLM system, you need to understand the problem well enough to evaluate it offline, not just react to outputs you happen to like in a demo.

When "just show us something" becomes the requirement

There is a specific kind of meeting that now feels familiar to anyone building with LLMs.

A chatbot is taking shape. The product team wants something useful, fast, and impressive. The direction sounds open-minded at first, even empowering: build something, show us how it responds, and we'll tell you whether it looks good or bad. We can tune it from there.

On paper, this sounds collaborative. In practice, it often means the team is being asked to build before anyone has defined the boundary conditions of the system. What should it answer? What should it refuse? What counts as a good answer? What kinds of ambiguity are acceptable? Which failures are tolerable, and which ones are not?

Those questions are not secondary. They are the work.

One of the more frustrating patterns I have seen in LLM product work is the assumption that response quality can be shaped quickly even when the problem itself has not been specified clearly.

We were building a chatbot, but the requirements were loose in the most important way. There were no meaningful boundary conditions and no clear description of expected outputs. The instruction was effectively: show us what you build with LLM responses, and then we will tell you if we like it. Once we do, we can tune the responses to our liking.

That sounds manageable until you live inside it for a while.

Because once the system starts producing outputs, every reaction becomes a requirement in disguise. A stakeholder likes one phrasing and dislikes another. One answer feels too broad, another feels too cautious. Something looks good in a demo but fails under variation. And yet the expectation remains that these are all just small adjustments, maybe a few hours of tuning here and there, as if the model is a UI layer with slightly imperfect copy.

But LLM systems do not work that way. You are not only editing text. You are shaping behavior under uncertainty. And behavior is influenced by scope, prompts, policy rules, retrieval design, context windows, examples, edge cases, and the distribution of real user inputs. If none of those have been clarified, "tuning" becomes a polite word for guessing.

This is where I think many teams lose time without realizing it. They think they are iterating quickly because they are changing outputs quickly. But changing outputs is not the same as making the system more reliable.

The architecture instinct is not always the right instinct

I have also seen another pattern that is subtler, but just as costly.

My engineering lead had strong technical foundations. He was good, and his instincts made sense in the worlds he came from, especially microservices and mobile app development. In those environments, structure, interfaces, modularity, and production readiness matter early. Packaging capabilities cleanly often is the right move.

But not every part of an LLM system should be treated that way from the beginning.

Some of the work we needed to do should not have started as a service at all. Some of it should have stayed offline longer. We needed evaluation loops, transcript analysis, test set development, and behavioral inspection before productionizing every decision. We needed to understand the system's expected behavior before wrapping it in clean interfaces and shipping it through the stack.

That distinction matters because LLM systems are partly software, but they are also partly experimental systems. They contain behavior that is not fully deterministic, and they fail in distribution-dependent ways. If you productionize too early without understanding the task deeply enough, you can end up formalizing confusion. You get a clean architecture around a messy problem definition.

And once that happens, rework becomes more expensive. Not because the team is bad, but because ambiguity has now been embedded in the system design.

The moment I started distrusting "in scope" and "out of scope"

Another experience made this even clearer for me.

We were building a chatbot, and product initially described the problem in a way that sounded fairly straightforward: the bot should answer certain questions and should not answer others. That immediately suggested a sensible design direction. We needed some form of intent recognition or gating layer, because the task was fundamentally about deciding what kind of question had been asked and whether it belonged inside the bot's allowed scope.

Then the requirements shifted.

We were told that the bot could not answer a question if it appeared in a certain phrasing, but we were not given clear scenario boundaries for when that phrasing should matter and when it should not. This is where LLM work starts to become dangerous in a very ordinary way. The requirement sounds precise linguistically, but it is not behaviorally precise. It leaves too much unresolved about the actual decision rule.

By that point, we had already tuned the agent with strict policy guidelines. So we passed a new set of in-scope and out-of-scope questions into it and reviewed the results. Some of the supposedly out-of-scope questions were generic enough that the bot answered them anyway. We showed this to product, expecting a serious discussion about policy consistency and failure modes.

Instead, the response was basically: this is good. Let's do this.

At one level, I understood it. The outputs looked acceptable. Nothing had failed dramatically in the room. The system was good enough to move forward.

But that moment stayed with me because it exposed a deeper issue: we were treating stakeholder reaction as evaluation. We still did not have a real test set, a robust boundary definition, or a disciplined method for deciding what behavior was correct. We were accepting behavior because it felt fine in a narrow review context, not because we had validated it against a meaningful distribution of user inputs.

That is a dangerous substitute. Especially with chatbots.

The dataset told a different story

The clearest lesson came later, when I ran my own analysis on a full transcript dataset.

What I found was hard to ignore. The questions product had told us were in scope represented only about 30% of the kinds of utterances that actual users were asking that were relevant to our system.

That one number changed the meaning of everything that came before it.

It meant our requirement discussions had been shaped around a partial view of the problem. It meant our evals were incomplete. It meant the performance metrics we had designed were measuring success against an artificially narrow task definition. It meant we had been tuning for a world that did not really exist in production.

This is the part I think many teams miss. In LLM product work, incomplete scoping does not just create a few missed edge cases. It distorts the entire system development process.

It affects which prompts you write. It affects which examples you include. It affects which refusals you design. It affects what your eval suite measures. It affects what product believes is improving. It affects what engineering believes is stable.

And by the time you discover the mismatch, you are no longer just fixing a prompt. You are undoing assumptions that have spread across design, policy, architecture, and reporting.

If we had paused earlier and scoped the requirements more honestly, we would have saved ourselves a lot of rework and a lot of false confidence.

Why this keeps happening?

I do not think this happens because people are careless. I think it happens because LLM systems create a misleading sense of progress.

You can get outputs very quickly. You can demo behavior very early. You can make something that looks surprisingly capable before you understand its real operating boundaries. That creates the impression that the problem is becoming concrete faster than it actually is.

In traditional software, missing requirements often shows up as missing functionality. In LLM systems, missing requirements can still produce fluent answers. The system continues to talk. It continues to look useful. That makes ambiguity easier to tolerate than it should be.

It also encourages teams to postpone the unglamorous work: transcript review, taxonomy design, adversarial examples, offline evals, edge case grouping, intent coverage analysis, and refusal testing. Those tasks can feel slower than shipping a prototype. But without them, the prototype quietly becomes the product strategy.

And that is when teams start tuning to taste instead of designing to a clear task.

What I would do differently now

The most practical lesson I have taken from these experiences is that LLM teams need to separate three things much more clearly than they usually do.

1. Product preference is not the same as product requirement

"It looks good" is not a spec.

Stakeholder reaction matters, but it cannot be the main mechanism for defining system behavior. Teams need explicit decisions about what the bot should answer, what it should refuse, how exceptions work, and how phrasing interacts with policy. If those rules are not written down, reviewed, and tested, they are not real requirements yet.

2. Offline evaluation should come earlier than production shaping

Not everything needs to become a service immediately. Some of the highest-value work happens before productionization: building datasets, reviewing transcripts, clustering utterances, defining intent boundaries, and testing behavior offline.

This is especially true when the system's core challenge is behavioral reliability rather than infrastructure complexity. You do not want elegant architecture around an unstable task definition.

3. Coverage matters more than people think

The transcript analysis was probably the most sobering part of the experience for me. If your understanding of user needs covers only a narrow slice of actual in-scope behavior, then every downstream metric is at risk of becoming misleading.

A chatbot can appear to perform well because it is being tested against a world that is smaller and cleaner than the one users actually live in.

4. Boundary conditions should be treated as first-class work

Teams often want to move quickly to prompts, tools, and orchestration. But one of the hardest and most valuable parts of chatbot design is boundary setting.

What should count as in scope?
What near-miss questions should be refused?
What ambiguous formulations should be clarified?
What kind of help is acceptable when a question is adjacent but not directly allowed?
What degree of paraphrase should still count as the same intent?

These are not minor details. They determine the shape of the system.

5. A changing requirement without test scenarios is not really a requirement

When someone says, "the bot should not answer this phrasing," the next question should be: under what examples, under what variants, and against what contrast set?

Without that, the team is left to infer behavioral rules from language that sounds specific but is actually under-defined. That usually leads to overfitting, inconsistent behavior, and arguments later about whether the model is wrong or the requirement was vague.

Actionable takeaways for teams building LLM products

If I were guiding a team through this earlier now, I would insist on a few practical steps before heavy tuning begins.

Start with a behavior map, not just a prompt. Write down the categories of questions the system should answer, refuse, redirect, or clarify. Include near-boundary examples, not just ideal cases.

Build a small but real eval set before polishing outputs. Even 100 to 200 carefully chosen examples can teach you more than hours of reactive prompt tweaking. Include paraphrases, edge cases, and ambiguous phrasing.

Use transcript data early. Do not rely only on what product believes users ask. Look at actual utterances as early as possible. Scope decisions made without distribution data are often fragile.

Keep some evaluation work offline. Before investing in full service design, confirm that the behavioral problem is defined well enough to measure. Architecture should support the task, not substitute for task clarity.

Treat requirement changes as test design events. Any new rule should immediately generate examples: what should happen, what should not happen, and what similar cases need to be distinguished.

Separate "demo acceptable" from "production reliable." A response that sounds good in a meeting is not evidence that the system is robust across real usage.

What I've learnt from this experience

The longer I work around LLM systems, the less impressed I am by early fluency and the more interested I am in requirement discipline.

A chatbot can say many plausible things before a team has earned the right to trust it. That is part of what makes this work exciting, but also what makes it easy to do badly in polished ways.

I have come to think that a lot of churn in AI teams is not really model churn. It is requirement churn wearing a technical costume. Product is still discovering what it wants. Engineering is trying to make that legible in systems. Everyone is moving, but not always with enough shared clarity to know whether movement is actually progress.

That does not mean teams should slow down endlessly. It means they should pause at the right moments. Especially before they start treating vague expectations as if they were already product truth.

Because once you start tuning, evaluating, and productionizing around an incomplete understanding of the problem, the work becomes more expensive to correct. And what looked like fast iteration starts to look more like drift.

The most useful thing a team can do early is often the least glamorous: define the boundaries, inspect the data, build the evals, and force the ambiguity into the open.

Everything gets easier after that.

Not simple. But at least honest.