Tool Calling as a Trust Boundary

We did not set out to build an "agent." We were building a customer-facing GenAI assistant for a banking product — something that could answer questions, guide users, and eventually take actions on their behalf. Pretty standard on paper.

Then we introduced tool calling.

That was the moment the problem changed shape.

The model was not just generating text anymore — it could now trigger real systems. Pull data, run workflows, potentially touch sensitive information. In a regulated environment, that is not a small step. That is a different category of risk entirely.

Reading through how teams like Anthropic and OpenAI think about tool use helped crystallize something for me: tool calling is not just a capability. It is a trust boundary. Once you cross it, you are no longer evaluating language — you are evaluating decisions with consequences.

So we had to rethink how we evaluate the system from the ground up.

What broke (and what we learned)

Our first instinct was honestly pretty naive. We treated tool calling like a classification problem: given a query, did the model pick the right tool?

That framing lasted maybe a week.

Because in practice, nothing about this is a single decision. Even a simple interaction unfolds into a chain of smaller, fragile decisions: should I call a tool at all? Which one? What arguments? Are those arguments even valid? Am I allowed to do this for this user?

The tricky part is that the model will often look "correct" at a glance while being subtly wrong in ways that matter. It might pick the right tool but construct slightly off parameters. Or call a tool when it should not. Or worse — skip a tool entirely and fabricate an answer.

We also realized quickly that language variation breaks everything. The same intent phrased differently could lead to different tools, different parameters, or no tool at all. So any evaluation that relies on a single phrasing is basically lying to you.

What helped us get unstuck was shifting how we think about the system. Instead of asking whether it got the right answer, we started asking:

Did it make a justifiable decision?
Did it stay within its permission boundaries?
Can we explain and audit what happened?

That led us to treat the model as an untrusted decision-maker — not malicious, just unreliable in subtle ways. The actual safety boundary lives outside the model.

So we built our evaluation around that idea.

We started logging everything about tool calls — not just which tool was called, but the parameters, the permission context, how long it took, and whether anything looked off. Not because we love logs, but because in a banking environment, you do not get to say "the model seemed fine." You need to reconstruct what happened after the fact.

Permissions became another big piece. One of the easiest mistakes to make is letting the model implicitly decide what it is allowed to do. We forced a separation: the model can suggest actions, but it never owns permissions. Those come from the user, the session, and the system. Our evaluation then checks whether the model is trying to step outside those bounds — or failing to use them when it should.

The dataset itself became a project. We did not just write a handful of test queries. We took real intents, expanded them into variations, combined tools in different ways, and then stretched them into multi-turn conversations. Some formal, some messy, some ambiguous, some intentionally risky. Every example is grounded enough to feel real, but varied enough to break brittle behavior.

And importantly, all of it is human-reviewed. Not because that scales (it does not), but because early on, you need to understand failure modes deeply before you automate anything.

Where we have landed (Phase 0)

What I would call Phase 0 is not sophisticated, but it is honest. We can see where the system fails. We can categorize those failures. We can trace them back to prompting issues, model limitations, or gaps in our tool design.

Where this needs to go

If I am being blunt, what we have today is still pretty simple.

It works as a lens, not as a guarantee.

A lot of our process depends on humans curating datasets and reviewing outputs. That is fine at the beginning, but it will not hold as the system grows. We will need more programmatic checks — especially around parameter validation and permission enforcement — and probably simulated environments where tools can fail safely.

We also treat correctness a bit too rigidly right now. The model sometimes takes a different but reasonable path, and we mark it as wrong. That is not quite right. Real systems have multiple valid solutions, and our evaluation needs to reflect that without becoming vague.

Multi-turn behavior is another area where we are underestimating complexity. It is easy to test a single interaction. It is much harder to test how decisions evolve over five or ten turns, especially when context starts drifting or compounding.

And then there is the stuff we have not fully tackled yet: adversarial behavior, prompt injection, attempts to misuse tools. In a banking setting, those are not edge cases — they are expectations.

What we are actually evaluating

If I zoom out, though, I think the biggest shift for me has been this:

We are not evaluating intelligence. We are evaluating controlled decision-making under constraints.

That changes how you design everything.

A good agent is not the one that knows the most. It is the one that consistently makes safe, explainable, permission-aware decisions — even when the input is messy and the path is not obvious.

If you are building something similar, the bar to aim for — at least initially — is not perfection. It is clarity. Can you see what the system is doing? Can you catch when it goes off the rails?

If you can do that, you are in a much better place than chasing a single accuracy number and hoping it means something.