Inside Kimi Code: a teardown of its agent engine

I cloned a shipping coding agent and read it the way I read a paper — to see which design decisions survive contact with production. Here are the ones worth stealing.

I learn more from reading one production agent than from ten framework tutorials. So when Kimi Code’s source went up, I cloned it and read it the way I read a paper — not to run it, but to find the decisions that only show up once an agent has to survive real users.

It’s a TypeScript monorepo, and the interesting parts are three packages stacked on top of each other. The names tell you the intent: kosong (Malay for empty) is a deliberately hollow LLM shell, kaos abstracts the execution environment, and agent-core — “the unified agent engine” — is where the actual agent lives.

The three packages. Everything novel is in agent-core; the layers below exist so it never knows which model or machine it's on.

The first decision that shaped everything else is also the one I keep coming back to.

The loop has no state. The host does.

The core loop in loop/run-turn.ts is a plain while (true), and it holds nothing — no session, no permission manager, no compaction logic, no UI handle. All of that arrives through hooks the host injects. The host (agent/turn/index.ts) is where state lives: it owns the session, drives autonomous goals, and retries on context overflow.

Keeping the loop stateless is what lets one loop run three jobs — an ordinary chat turn, an autonomous goal continuation, and a sub-agent — without a branch for each. The “modes” live in the host; the loop just turns the crank.

It also makes the loop testable with nothing but hooks, which is why their loop test suite can cover behavior the rest of us only catch in production.

The loop itself is ReAct, but with one twist I didn’t expect: the model’s finish_reason is treated as a diagnostic, not a control signal. A step ends, and the engine derives whether to continue from whether tool calls came back — if the provider says completed but emitted tool calls, it’s forced to tool_use and the loop continues; no tools means end_turn. That single inversion absorbs every provider’s idiosyncratic way of saying “I’m done.”

One step. The messages are rebuilt on every pass, so a mid-turn interjection, an injected reminder, or a compaction all take effect before the next model call.

Rebuilding the message list on every step (rather than passing a frozen history into a run) is the other quiet decision here. It’s what makes steering, injection, and compaction possible mid-turn — the loop always reads the latest context, never a snapshot.

Tools run out of order. The transcript never does.

Each tool declares what it touches — reads /a/b, writes /a recursively, has global side effects. A scheduler reads those declarations and runs non-conflicting tools in parallel while serializing the ones that collide. Two reads of different files go at once; two writes to the same path queue; a shell command, which claims everything, runs alone.

The part I’d have gotten wrong: execution is concurrent, but the tool.call and tool.result events are written to the transcript strictly in the model’s original order. All the messiness is absorbed inside the scheduler, so the recorded conversation is always linear and replayable.

Push the concurrency complexity into the scheduler; hand the rest of the system a tidy, ordered record.

Loops within the loop get the same treatment. Repeated identical tool calls are tracked with a streak counter, and the response escalates: a gentle nudge at three in a row, a specific report at five, a deadlock warning at eight, and at twelve the engine just stops the turn. Getting stuck is treated as a first-class failure mode, not something to hope the model notices.

Context is managed with two kinds of compaction

This is the part I came for, and it’s the most carefully built. There are two compactors, and they’re complements rather than alternatives.

Full compaction is the LLM-summarize-the-prefix move, but with the sharp edges filed off:

It triggers at ~85% of the window, and always reserves ~50k tokens for the model’s own output.
It only cuts at a safe split point — never mid-way through a tool-call/result pair, never just before a user message.
The summarization call reuses the same model, same system prompt, and the projected message history — so the compaction request itself hits the KV cache.
The current TODO list is appended to every summary, so a plan never gets summarized away.
It guards against its own overflow: if the prefix is too big to summarize, it shrinks the range and retries, requiring at least a 5% reduction to count as progress.

Micro compaction is the cleverer, quieter one. Instead of summarizing, it replaces the content of old, large tool results with a one-line marker — and it only does this when the conversation has been idle long enough (~1h) that the KV cache has almost certainly gone cold. The logic is beautiful: you’re about to pay to recompute the prefix anyway, so reclaim the dead weight on the way. And it never mutates history — the trimming happens at projection time, so undo and replay still see the originals.

Two compactors on one window. Micro reclaims dead tool output when the cache is already lost; full summarizes the prefix when the window fills.

Underneath both sits one rule that governs the whole context: append-only. Dynamic context — environment notes, reminders, goal status — is never spliced into the prefix; it’s wrapped in <system-reminder> and appended to the tail as a user-role message (so it doesn’t perturb the system prompt and break the cache). A comment in the goal injector spells out the lesson the hard way: it used to re-inject per step, which made context grow O(n²), and was rewritten to inject only at turn boundaries. That’s a scar I’d rather learn from than earn.

There is, notably, no vector store and no long-term memory. The only thing that persists across sessions is a chain of AGENTS.md files (user-level down to the working directory) — the same idea as Claude Code’s CLAUDE.md. Everything else is reconstructed by replay.

Resume is just replay — through the same code path

State isn’t serialized and reloaded. Every action that changes state is written as a record to a wire.jsonl, and resuming means replaying those records. The trick that makes it trustworthy: replay runs the exact same code as live execution — a flag just turns the outward-facing emits into no-ops. So the “restore” path can’t drift from the “execute” path, because there is no separate restore path. Large binary blobs are externalized by hash (with dedup) so the log stays small and readable, and the wire format is versioned with per-version migrations.

Sub-agents are full agents, not virtual sessions

When the agent delegates, the child isn’t a lightweight context fork — it’s a complete Agent instance with its own wire.jsonl, its own profile and tool subset, and the same turn loop. That means a sub-agent is independently resumable, and a swarm of up to 128 of them can be retried by id. Context is strongly isolated: the parent’s history isn’t copied in, and only the child’s final summary comes back — and if that summary is under ~200 characters, the engine automatically asks it to expand rather than accept “done.”

The batch scheduler behind a swarm is a tiny rate-limit controller: launch five, add one every 700ms, and on a provider 429 the whole batch shifts into exponential backoff (3s, 6s, 12s…), shrinks capacity, and recovers by one slot for every few minutes without a limit.

My favorite hack lives here. When you interject a side question mid-task, it spins up a throwaway agent with a copy of the parent’s projected context and a deny-all permission policy — but it still ships the full tool definitions to the model. Why include tools it’s forbidden to call? To keep the prompt-cache prefix identical. That’s the kind of detail you only write after watching a cache-hit graph.

Autonomy is an outer loop, not a longer turn

The autonomous “goal” mode is, in the code’s own words, the equivalent of a user repeatedly typing continue. A driver loop runs ordinary turns back to back; the model exits by calling a single tool, UpdateGoal, with a status. That tool deliberately has no reason field — the status is the machine-readable signal, and the model explains itself in its normal reply. Budgets (tokens, turns, wall-clock) are checked between turns, and the objective is injected wrapped in <untrusted_objective> so the model treats it as data, not as instructions that could override the system prompt.

Autonomy as a loop of normal turns. Every turn boundary is a real, observable idle point — easy to cancel, easy to interrupt.

I find this more honest than stretching a single turn forever. Each turn ending is a genuine idle point you can cancel or step into — the autonomy is in the cadence, not in some monster turn that never returns.

What I’m taking away

A few of these are going straight into how I build:

Stateless loop, stateful host. One loop, many modes. The modes are the host’s problem.
finish_reason is a hint, not a verdict. Derive “keep going?” from the presence of tool calls and you stop fighting provider quirks.
Append-only context, cache as a first-class constraint. Reminders go on the tail as user messages; nothing rewrites the prefix. Per-step injection is an O(n²) trap.
Compaction reuses the live request shape so the summarization call itself caches, and never cuts a tool pair in half.
Resume = replay through the same code path. No second implementation of state to drift.
Concurrent execution, ordered record. Push the mess into a scheduler; keep the transcript linear.
Autonomy is an outer loop of ordinary turns, with the model holding a single status lever to exit.

None of it is exotic. It’s the accumulation of small, correct decisions — the unglamorous kind that only reveal their value the third time a session resumes cleanly or a cache stays warm. I still translate each one into my own setup rather than copy it, but the reading was worth more than the cloning.