Code-action: write code, don't collect tools

Standard tool-calling can't express composition. So stop picking tools — let the model write the code that calls them.

Here’s a pattern that quietly kills agent projects. The agent handles simple things by stringing a few tools together. Then a complex or bulk request arrives — “re-tag these two hundred records,” something needing filter-then-combine-then-aggregate — and it can’t. The reflex is to write another tool. But the missing capabilities combine combinatorially; hand-writing tools is O(number of requests), and you lose the race.

The wall isn’t the framework. It’s the action model.

Tool-calling can’t compose

Standard tool-calling emits one call (or a few parallel ones) per round, sees the result, emits the next. Five hundred records becomes five hundred round-trips: slow, expensive, context-blowing, and prone to losing state halfway. You’re forcing the model to simulate a for loop in generated tokens — the thing it’s worst at. No amount of extra tools fixes that; they just hard-code one more special case.

Write the code instead

Don’t have the model pick a tool. Have it write code that calls the tools.

Expose the backend primitives not as tool schemas to choose from, but as a set of functions in a sandbox. The model’s action becomes write a short script — with loops, filters, conditionals, feeding A into B — and run it once.

# "tag the 200 active candidates as reviewed"
cands = client.list_candidates(status="active", page_size=200)["items"]
done = [client.tag_candidate(c["id"], "reviewed") for c in cands]
print(f"handled {len(done)}")

One script, one execution, only the aggregate comes back — not two hundred intermediate results clogging the context.

	Collecting tools	Code-action
New request	you write a tool	nothing
Where composition lives	hard-coded in tools	model writes it at runtime
500 records	500 round-trips	one script
Your effort	O(requests)	O(primitives)

The safety story stays simple for internal, read-mostly work: the sandbox holds only the current user’s token, and a gateway passes that identity through with rate-limiting and audit. The agent can write anything, but it can’t step past the user’s own permissions.

You maintain a stable layer of primitives. The combinatorial explosion of combinations moves to where it belongs — the model’s runtime, not your backlog. It’s the same idea showing up as code execution with MCP and “code mode” elsewhere: the most leveraged move in a maturing agent isn’t more tools, it’s letting the model write code against the ones you already have.