Here’s a pattern that quietly kills agent projects. The agent handles simple things by stringing a few tools together. Then a complex or bulk request arrives — “re-tag these two hundred records,” something needing filter-then-combine-then-aggregate — and it can’t. The reflex is to write another tool. But the missing capabilities combine combinatorially; hand-writing tools is O(number of requests), and you lose the race.
The wall isn’t the framework. It’s the action model.
Tool-calling can’t compose
Standard tool-calling emits one call (or a few parallel ones) per round, sees the result, emits the next. Five hundred records becomes five hundred round-trips: slow, expensive, context-blowing, and prone to losing state halfway. You’re forcing the model to simulate a for loop in generated tokens — the thing it’s worst at. No amount of extra tools fixes that; they just hard-code one more special case.
Write the code instead
Don’t have the model pick a tool. Have it write code that calls the tools.
Expose the backend primitives not as tool schemas to choose from, but as a set of functions in a sandbox. The model’s action becomes write a short script — with loops, filters, conditionals, feeding A into B — and run it once.
# "tag the 200 active candidates as reviewed"
cands = client.list_candidates(status="active", page_size=200)["items"]
done = [client.tag_candidate(c["id"], "reviewed") for c in cands]
print(f"handled {len(done)}")
One script, one execution, only the aggregate comes back — not two hundred intermediate results clogging the context.
| Collecting tools | Code-action | |
|---|---|---|
| New request | you write a tool | nothing |
| Where composition lives | hard-coded in tools | model writes it at runtime |
| 500 records | 500 round-trips | one script |
| Your effort | O(requests) | O(primitives) |
The safety story stays simple for internal, read-mostly work: the sandbox holds only the current user’s token, and a gateway passes that identity through with rate-limiting and audit. The agent can write anything, but it can’t step past the user’s own permissions.
You maintain a stable layer of primitives. The combinatorial explosion of combinations moves to where it belongs — the model’s runtime, not your backlog. It’s the same idea showing up as code execution with MCP and “code mode” elsewhere: the most leveraged move in a maturing agent isn’t more tools, it’s letting the model write code against the ones you already have.