Skip to content
· 22 min read

The Spec Is Everything: How I Built a 14-Agent Team That Actually Works

I designed a multi-agent system to audit a physics engine. The agents weren't the hard part. The orchestration protocol was.

I spent the last few months building a physics-realistic billiards engine. Not a toy demo. A real engine where spin transfer, friction cones, cushion rebounds, and sliding-to-rolling transitions all behave the way they do on an actual table. The kind of engine where a single sign convention mismatch between the contact normal and the relative surface velocity makes the math look right while the ball does something physically impossible.

By the time I had a working engine, I had a new problem. I needed to audit it. Not “run the tests and see green.” Actually audit it. I needed someone to check whether my collision model violates conservation laws under edge cases. Whether my friction regime transitions are numerically stable at high spin and low speed. Whether my cushion model matches the qualitative behavior that any decent pool player would notice. Whether the whole thing is deterministic across platforms. Whether my benchmarks are actually measuring what I think they’re measuring.

That’s not a one-person job. That’s a team.

So I built one. Out of AI agents.

The punchline, before we start

Here’s what I’m going to walk you through in this post: a 14-agent system I designed from scratch, deployed on Claude Code’s Agent Teams feature, and ran for 3 hours against my engine. I’ll show you the full agent specs, the orchestration protocol, the dependency graph, the coordination rules, and the results.

But I want to be honest about the takeaway before we get into the details, because the details are long and it would be unfair to bury this.

The agents are not the hard part. The orchestration protocol is everything.

Without the protocol, the same agents with the same capabilities produce shallow, disjointed output that’s no better than what you’d get from a single good prompt. With the protocol, they ran for 3 hours autonomously and produced the most thorough audit I’ve received from any agentic setup, and I say that having spent thousands of hours working with coding agents over the past year.

If you take one thing from this post, take that. The spec is the product.

Now let me show you how.

Why agents, and why not just “prompt harder”

I want to address the obvious question. Why not just open Claude, paste in your codebase, and say “audit my physics engine”? I’ve done that. Many times. Here’s what happens:

The model gives you a competent surface-level review. It identifies some real issues. It misses the subtle ones. It doesn’t cross-reference against the literature. It doesn’t check whether your friction model matches the Marlow or the Han model. It doesn’t run your benchmarks and check if the numbers drift over long rollouts. It doesn’t think to search for billiards physics papers in Russian or Chinese (where some of the best empirical data lives). It doesn’t red-team its own conclusions.

Single-prompt approaches hit a ceiling because complex audits require multiple cognitive modes operating in sequence and in parallel: deep research, first-principles physics reasoning, formal mathematical analysis, test engineering, epistemic validation. No single prompt can sustain all of those modes simultaneously without losing depth in each one.

I needed specialization. I needed agents that could go deep on one axis while other agents went deep on different axes, then a mechanism to synthesize everything without losing the nuance.

The design process: Figjam, pen, and a lot of thinking

I didn’t start in code. I started in Figjam.

I laid out sticky notes for every cognitive function I’d need. Not “agents” yet. Functions. What does a thorough audit actually require? I mapped it out:

  1. Someone needs to understand what’s in my research folder (papers, notes, reference implementations)
  2. Someone needs to survey the literature beyond what I already have
  3. Someone needs to find comparable open-source engines and figure out what they do well and poorly
  4. Someone needs to check whether my physics violates first principles
  5. Someone needs to check whether my numerics are stable
  6. Someone needs to design and implement new tests
  7. Someone needs to challenge all the conclusions and find the weak links

Then I asked: what conventions do all of these functions share? What handoff protocol prevents work from getting lost between them? What failure modes are most likely when multiple agents work in parallel?

That thinking produced two files:

  • agentlist.md: the full spec for 14 agents
  • instruction.md: the orchestration protocol that coordinates them

Let me walk you through both.

Part 1: The agents

Before the agents: the constitution

Before defining any individual agent, I defined the rules that every agent must follow. I call these the system-wide conventions, and they’re the foundation everything else sits on. Without them, each agent invents its own format, its own terminology, its own idea of “done,” and the synthesis step becomes a nightmare.

Three conventions matter most:

Claim hygiene. Every non-trivial statement an agent makes must be annotated with five things:

Field What it means
Claim The statement being made
Support Evidence or reasoning path behind it
Confidence High / Medium / Low
Assumptions What must be true for this claim to hold
Counterpoints What could invalidate it

This sounds bureaucratic until you realize that without it, agents make confident-sounding claims backed by nothing, and the synthesis agent has no way to tell which findings are solid and which are vibes.

Standard artifact types. Every agent output must be labeled as one of eight types: Plan, Research Notes, Design Proposal, Decision Memo, Spec, Eval Harness, Red-Team Report, or Synthesis. This means downstream agents always know what they’re receiving and how to process it.

Handoff protocol. When an agent finishes, it must state: what it produced, what it needs next, what questions remain open, and which agent should pick up the work. No agent is allowed to just stop and say “done.”

The 14 agents

I designed 14 agents. Each one has a mission statement, core responsibilities, explicit inputs and outputs, methods and heuristics, a quality bar, and failure mode guardrails. I’m going to show you all of them, but first, here’s the map:

# Agent One-line mission
1 Principal Investigator (PI) Own the outcome: set strategy, delegate, and decide
2 Reasoner-Planner (RPA) Turn this into an executable plan with gates and deliverables
3 Clarifier Ask the minimum questions to remove ambiguity
4 Information Analyst Deep research: extract insights, not links
5 Physicist First-principles sanity check and experimental rigor
6 Mathematician Formalize, prove, analyze complexity/optimization
7 Open Source Researcher Find, evaluate, and integrate the best repos
8 AI/ML Engineer Design an ML system that ships and stays reliable
9 Synthesis Editor Merge outputs into a coherent, contradiction-free doc
10 Eval Scientist Make quality measurable: benchmarks, rubrics, regression
11 Reflection Agent Critique, learn, and improve the team’s process
12 Swarm Coordinator Run parallel exploration and drive consensus
13 Epistemic Validator Truth firewall: logic, bias, uncertainty, verification
14 File Researcher Local corpus ingestion + routing

Some of these are self-explanatory. Some are not. Let me walk through the ones that carry the most weight.

Agent 1: The Principal Investigator

This is the brain of the operation. The PI doesn’t do the deep technical work. Instead, it frames the problem, decides which agents to deploy and why, generates multiple hypotheses (explicitly avoiding anchoring on the first plausible plan), designs the smallest experiments that de-risk the biggest unknowns, resolves conflicts between agents, and gates progression.

The key design choice here: the PI is not allowed to advance the project without meeting defined quality bars. It can’t hand-wave past a weak finding. If the Physicist says something is concerning and the evidence is Medium confidence, the PI has to either get more evidence or acknowledge the gap in the final report.

Failure: scope creep
→ Guardrail: enforce "MVP path" + explicit de-scoping list

Failure: consensus by vibes
→ Guardrail: require evidence + eval hooks for critical claims

Agent 3: The Clarifier

This agent exists because I’ve watched agentic systems waste enormous amounts of compute going down the wrong path because nobody stopped to ask “wait, what exactly are we trying to do here?”

The Clarifier’s job is to ask the minimum set of high-information questions that unblock progress. Not twenty questions. Three to seven, max, per round. Each one optimized for information gain: how much does the answer reduce ambiguity or branching factor?

The heuristic I gave it:

Start with: goal, audience, constraints, format, deadline.
If the user can't answer: suggest reasonable defaults and label them.

That last line is important. The Clarifier is not allowed to stall the project waiting for perfect information. If it can’t get an answer, it proposes a default, labels it as assumed, and moves on.

Agent 4: The Information Analyst

This is not a web search agent. This is a deep research agent. The difference matters.

A web search agent finds relevant links. The Information Analyst reads papers, tech reports, forums, and long threads. It detects “tribal knowledge” hidden in GitHub issues and war stories and niche communities. It cross-checks claims across multiple independent sources. It identifies points of disagreement and explains why they disagree.

And critically: it runs multilingual queries. Chinese, Russian, Japanese, Korean, German, French. I added this because during my own manual research for the billiards engine, some of the best empirical measurement data came from a Russian university paper that would never show up in an English-only search.

The quality bar I set:

Findings are actionable and reconciled; no "search result dumping."

Agent 5: The Physicist

This one is personal to me. I needed an agent that thinks like a physicist, not like a software engineer. Reduce problems to core variables, units, invariants, and conservation laws. Catch impossible or inconsistent assumptions. Design falsifiable tests. Prefer causal and mechanistic narratives over correlation-only thinking.

The heuristic I love most:

Always ask: "what changes if scale increases 10x?"

That one question catches an enormous number of hidden assumptions about numerical stability, performance, and physical correctness.

Agent 6: The Mathematician

Separate from the Physicist because the failure modes are different. The Physicist catches things that are physically wrong. The Mathematician catches things that are numerically wrong: floating-point sensitivity, convergence failures, stability conditions, integration drift.

Quality bar: No hand-wavy math. Uncertainty labeled explicitly.

Method: Use counterexample search to test "obvious" claims.
When possible, produce small toy examples that expose failure modes.

Agent 10: The Eval Scientist

This is the agent that turns everything into numbers. It designs scoring rubrics, builds golden test sets and adversarial cases, creates regression harnesses, and runs slice-based performance breakdowns to identify where the engine’s performance collapses.

The quality bar for this agent is the strictest of any:

A change cannot be merged or shipped without passing defined eval gates.

And the heuristic for test design:

Prefer tests that are:
  - cheap to run
  - hard to game
  - strongly predictive of real-world quality

Agent 13: The Epistemic Validator

This is maybe my favorite agent. Its entire job is to be the truth firewall. It detects contradictions within and across agent outputs. It identifies unstated assumptions. It spots cognitive biases: confirmation bias, anchoring, survivorship bias, motivated reasoning. It forces calibrated confidence statements. It attempts to construct counterexamples for every major claim.

Quality bar:
No critical claim passes without either evidence
or an explicit "uncertain" label plus a verification plan.

Most multi-agent systems skip this role entirely, and then wonder why their outputs contain confident-sounding nonsense.

Agent 14: The File Researcher

This agent solves a specific problem: I have a local folder full of research papers, notes, reference implementations, and benchmark data. That folder is messy. It has duplicates, different file formats, papers on different subtopics.

The File Researcher scans that entire folder, classifies every file, extracts the key artifacts, and then routes each artifact to the right specialist agent. Physics papers go to the Physicist and Mathematician. Benchmark data goes to the Eval Scientist. Reference implementations go to the Open Source Researcher. Calibration data goes to the Eval Scientist and Epistemic Validator.

It produces Agent Brief Packs: one per recipient, with the top 5 most relevant files, extracted key snippets, implications for our engine, and recommended next actions.

Quality bar:
A specialist agent should be able to start work using only the brief pack.

The remaining agents (quick hits)

Agent What it does Why it exists
Reasoner-Planner (RPA) Decomposes the PI’s strategy into atomic executable steps with explicit owners, inputs, outputs, and acceptance criteria Plans without this agent have “magic steps” where nobody specified how to actually do the thing
Open Source Researcher Finds comparable engines and tools, reads their issues/PRs for hidden limitations, builds shortlists with tradeoffs READMEs lie. Issues don’t. This agent trusts issues over READMEs.
AI/ML Engineer Bridges research and production. Separates offline metrics, online metrics, and safety metrics. Insists on baselines and fallbacks. Because a beautiful model that can’t be deployed is not an ML system
Synthesis Editor Merges all agent outputs into one coherent doc. Resolves contradictions. Normalizes terminology. The editor writes the executive summary last, after seeing everything
Reflection Agent Runs pre-mortems (assume failure happened, explain how, prevent it). Diagnoses root causes of errors. Identifies “unknown unknowns” by listing what was not checked
Swarm Coordinator Manages parallel exploration. Uses “independent first pass then merge” to reduce anchoring. Without this, parallel agents anchor on each other’s early outputs and converge prematurely

Team operating modes

Not every task needs all 14 agents. I designed three operating modes:

Mode When to use Active agents Duration
A: Fast Triage Quick assessment, v0 deliverable Clarifier → RPA → PI → Synthesis Editor Minutes
B: Deep Research Thorough investigation, full evidence Swarm Coordinator parallelizes all research agents, Physicist + Mathematician validate, Eval Scientist measures, Epistemic Validator audits Hours
C: Production Hardening Ship-ready quality gate Eval Scientist + AI/ML Engineer build regression harness, Reflection Agent pre-mortems, PI approves Hours

For the physics engine audit, I ran Mode B.

Part 2: The orchestration protocol

This is the part that matters most. The agent specs define what each agent can do. The orchestration protocol defines how they work together. And I cannot stress this enough: the protocol is the product.

I learned this the hard way. Early experiments with the same agents but weaker protocols produced mediocre results. The agents would overlap, miss gaps, write conflicting outputs, and lose context mid-way through. Once I tightened the protocol, the same agents went from “useful but messy” to “the best agentic output I’ve ever gotten.”

Consolidation: 14 specs into 7 active teammates

Running 14 agents simultaneously would blow through tokens and create coordination overhead that outweighs the benefit. For this specific task, I consolidated the 14 specs into 7 active teammates:

# Teammate Consolidated from Model
1 File Researcher File Researcher spec Sonnet
2 Literature & Prior Art Analyst Information Analyst + multilingual research Sonnet
3 Open Source Scout Open Source Researcher Sonnet
4 Physicist (Mechanics Auditor) Physicist spec Opus 4.6
5 Mathematician (Numerics Auditor) Mathematician spec Opus 4.6
6 Eval Scientist / Test Engineer Eval Scientist spec Opus 4.6
7 Epistemic Validator / Red Team Epistemic Validator + Reflection Agent Opus 4.6

Notice the model allocation. This isn’t random. Search and retrieval tasks go to Sonnet (cheaper, faster, good enough). Novel reasoning, physics, math, and idea generation go to Opus 4.6 (more expensive, but the quality difference matters for these tasks). Balancing capability against token cost is part of the engineering.

The rules that prevent chaos

Through trial and error (mostly error), I developed a set of coordination rules that address the most common multi-agent failure modes. These are the rules I wish someone had told me before I wasted a lot of tokens learning them myself.

Rule 1: One source of truth.

Use the shared task list as the single source of truth for all work.
Lead MUST create tasks sized so each teammate can finish
within a reasonable sprint.

Without this, agents lose track of what’s been done and what hasn’t.

Rule 2: No file conflicts. Ever.

Each teammate writes ONLY to their designated output file(s).
No two teammates edit the same file. Ever.
Lead assembles the final report from teammate outputs.

The workspace layout I enforce:

docs/audit/
├── README.md
├── CONTEXT.md
├── EVIDENCE_MATRIX.md          # Living claim→evidence table
├── notes/
│   ├── file_researcher.md
│   ├── literature_prior_art.md
│   ├── open_source_scout.md
│   ├── physicist_audit.md
│   ├── math_numerics_audit.md
│   ├── eval_tests.md
│   └── epistemic_red_team.md
├── handoffs/
└── FINAL_REPORT.md             # Lead assembles at end

Each agent writes to exactly one file. The lead agent reads from all of them and assembles the final report. Simple, but it eliminates an entire class of bugs.

Rule 3: Persist continuously, not at the end.

Teammates MUST persist all intermediate findings to disk continuously
(not just at the end).

If running low on context, immediately create a handoff document with:
  - Summary of discoveries so far
  - What remains to investigate
  - Specific next steps for a replacement teammate
  - Key sources/references consulted

This is the rule that protects against context rot. LLM agents have finite context windows. If an agent burns through its context doing deep research and then loses everything because it didn’t write to disk, that’s hours of work gone. The persist-continuously rule means that even if an agent hits its limit, the work survives.

Rule 4: No code without approval.

Eval Scientist must propose: files to create/modify, test categories,
expected outputs, commands.
Lead reviews and approves/rejects with feedback.
No code changes until approved.

Only one agent (the Eval Scientist) is allowed to write code, and only after the Lead explicitly approves a plan. This prevents the very common failure mode where an eager agent starts rewriting your test suite and introduces bugs.

The dependency graph

This is the part I’m most proud of. The dependency graph defines which tasks can run in parallel and which block on others. Getting this right is the difference between a 3-hour run and a 12-hour run that produces worse results.

T0  Preflight + scaffold workspace           [Lead]
│
├── T1  File scan + routed summaries         [File Researcher]
├── T2  Literature research sprint           [Literature Analyst]
├── T3  Open source discovery sprint         [Open Source Scout]
├── T6  Run baseline tests + benchmarks      [Eval Scientist]
│
│   ┌── T1 + T2 complete ──┐
│   │                       │
│   T4  Physics audit       T5  Math/numerics audit
│   [Physicist]             [Mathematician]
│   │                       │
│   └───────┬───────────────┘
│           │
│   T3 + T4 + T5 + T6 complete
│           │
│   T7  Test expansion plan + harness design [Eval Scientist]
│           │
│   T8  Implement starter tests              [Eval Scientist]
│           │                                (PLAN APPROVAL REQUIRED)
│   T2-T8 complete
│           │
│   T9  Epistemic validation                 [Epistemic Validator]
│           │
│   All complete
│           │
│   T10 Final synthesis                      [Lead]

Notice: T1, T2, T3, and T6 all run in parallel. They only depend on T0 (preflight). This means four agents are working simultaneously in the first phase, which is where most of the information gathering happens.

T4 and T5 (the physics and math audits) wait for T1 and T2 to finish. This is intentional. The Physicist and Mathematician need the File Researcher’s brief packs and the Literature Analyst’s findings before they can do meaningful work. Starting them earlier would waste tokens on uninformed analysis.

T7 (test expansion planning) waits for almost everything. The Eval Scientist needs to know what the Physicist found concerning, what the Mathematician flagged as numerically unstable, and what comparable engines do differently before designing the test expansion.

T9 (epistemic validation) runs last before synthesis, because its job is to challenge everything that came before it.

What each agent must report

Every agent, when finishing a task, must message the lead with exactly four things:

1. "What I did"          (summary)
2. "Key findings"        (bullet points)
3. "Evidence strength"   (strong/medium/weak with reasoning)
4. "What I need next"    (blockers, if any)

This isn’t optional. This is what makes the lead agent’s job possible. Without structured reporting, the lead has to read through pages of unstructured notes to figure out what actually happened.

The 8-axis engine audit

The protocol specifies exactly which axes the audit must cover:

Axis What to verify
Dynamics/integration stability Energy drift, dt sensitivity, integrator choice
Collisions Ball-ball with spin, stick-ball, rail/cushion, pockets
Friction regimes Sliding→rolling transition, spin decay, throw/english effects
Material parameters Restitution, friction coefficients, cushion response, calibration
Multi-contact edge cases Simultaneous impacts, clustered rack, break shot
Determinism & reproducibility Same seed → same result across runs/platforms
Performance/throughput Worst-case spikes, scaling with complexity
Numerical robustness Tunneling, penetration correction, jitter, floating-point sensitivity

And the mandatory physics realism checks that must be included regardless:

Check What to verify
Ball-ball collision Outcomes across cut angles, speeds, spins; spin transfer; tangential effects
Cushion rebound Running english vs reverse english behavior
Sliding-to-rolling Transition distance dependence on speed, spin, friction
Spin decay Rate of spin decay on cloth over time
Near-grazing hits Numerical sensitivity and stability
Simultaneous multi-ball Break shot, clustered rack collision handling
Tunneling / dt sensitivity Time-step sweep tests
Energy drift Long rollout stability and conservation

The final report structure

The protocol mandates exact headings for the final report. This isn’t pedantic. It’s what makes the output consistently useful across runs:

# Pool Physics Engine Audit Report

## 1. Executive Summary
## 2. What We Audited (engine scope + assumptions)
## 3. Credibility Assessment (with confidence + evidence)
## 4. Literature & Prior Art Review (with citations and "so what")
## 5. Gap Analysis & Failure Modes (ranked, with reproduction ideas)
## 6. Test Strategy & New Test Suite Plan
## 7. Metrics, Baselines, Targets (table)
## 8. Improvement Roadmap (0–2 weeks, 2–6 weeks, 6+ weeks)
## 9. Calibration & Validation Plan (real-world measurement strategy)
## 10. Epistemic Validator Report (contradictions, weak links, verification steps)
## 11. Reflection Report (process critique + next audit iteration)
## 12. Appendices
    A) Evidence Matrix
    B) Paper Notes
    C) Shot Catalog
    D) Test Harness Details
    E) Local Corpus Summary

The non-negotiables

At the bottom of the protocol, I placed a section called “Absolute Requirements” that overrides everything else:

- All non-trivial claims MUST be backed by sources or experiments
- Maintain the Evidence Matrix: claim → source → confidence → caveats
- When sources disagree, surface the disagreement AND propose
  an experiment to resolve it
- Provide reproducible test code or specs with seeded randomness
- Multilingual research is REQUIRED (7+ languages minimum)
- Do NOT produce "vibes" — produce numbers, acceptance criteria,
  and clear next steps

That last line is my favorite. “Do NOT produce vibes.” It’s the single most effective instruction I’ve found for getting agents to produce useful output instead of confident-sounding filler.

What happened when I ran it

I launched the team on a Sunday afternoon. Seven agents, Mode B (Deep Research), targeting my billiards engine repo and a local folder of research papers.

The first phase (parallel research sprint) took about 45 minutes. The File Researcher scanned my local corpus and produced brief packs for each specialist. The Literature Analyst ran searches in English, Chinese, Russian, Japanese, Korean, German, and French, and found papers I hadn’t encountered in my own research. The Open Source Scout found and evaluated five comparable engines. The Eval Scientist ran my existing test suite and benchmarks to establish baselines.

The second phase (physics and math audits) took about an hour. The Physicist audited collision behavior, friction regimes, cushion dynamics, and spin transfer against first-principles expectations. The Mathematician audited numerical stability, integration drift, floating-point sensitivity, and event ordering.

The third phase (test expansion, epistemic validation, synthesis) took the remaining time. The Eval Scientist designed a layered test taxonomy, the Epistemic Validator challenged the conclusions and flagged weak links, and the Lead assembled the final report.

Total runtime: approximately 3 hours.

The output

The final report was structured exactly as the protocol specified. Twelve sections, with appendices. Here’s what I got:

  • An evidence matrix mapping every major claim to its source, confidence level, and caveats
  • A gap analysis ranked by severity, with reproduction ideas for each gap
  • A test expansion plan covering unit tests, property-based tests, regression tests, scenario tests, and adversarial tests
  • A calibration strategy specifying what real-world measurements are needed and how to fit parameters
  • An epistemic validation report flagging contradictions, weak evidence, and proposed disambiguating experiments
  • A roadmap broken into 0-2 weeks, 2-6 weeks, and 6+ weeks time horizons

Every claim had a confidence rating. Every recommendation had an acceptance criterion. Every disagreement between sources was surfaced with a proposed experiment to resolve it.

It was, by a significant margin, the most thorough output I’ve gotten from any agentic setup.

What I actually learned

1. The protocol is everything

I keep saying this because it’s the single most important lesson. The same agents, with the same capabilities, with the same models, produce dramatically different output depending on the orchestration protocol. A weak protocol produces shallow work. A strong protocol produces deep work. The agents are the engine. The protocol is the driver.

2. File conflict prevention is not optional

Early in my experiments, I let agents write to shared files. It was a disaster. Agents would overwrite each other’s work, create merge conflicts, and lose findings. The rule “each teammate writes ONLY to their designated file, no exceptions” eliminated this entire class of failures.

3. Persist-continuously saves you from context rot

LLM agents have finite context. When they hit the limit, they lose everything in working memory. The persist-continuously rule means that every significant finding is written to disk as it’s discovered, not at the end. When an agent hits its limit, it creates a handoff document and a replacement can pick up where it left off.

4. Model allocation matters

Not every task needs the most expensive model. Search and file scanning work fine on Sonnet. Physics reasoning and mathematical proofs need Opus. Matching model capability to task complexity saves tokens without sacrificing quality on the tasks that actually require intelligence.

5. Structured reporting is what makes synthesis possible

Without the four-part reporting format (what I did, key findings, evidence strength, what I need next), the lead agent has to parse unstructured prose to figure out what happened. With it, synthesis becomes mechanical: read the structured reports, resolve conflicts, assemble the document.

6. The Epistemic Validator is the most underrated role

Most people designing multi-agent systems include researchers, planners, and executors. Almost nobody includes a dedicated truth checker whose only job is to find problems with everyone else’s work. Adding the Epistemic Validator caught issues that would have shipped in the final report without it: overclaimed confidence, unstated assumptions, contradictions between the Physicist and the Mathematician that needed resolution.

7. Multilingual research finds things English-only research misses

The Literature Analyst’s multilingual queries found papers and empirical data that never appeared in English searches. If your domain has an international research community (and most technical domains do), restricting your research agents to English is leaving information on the table.

The meta-lesson

Here’s what I keep thinking about. The reason the orchestration protocol works is not because it’s clever. It’s because it encodes the same practices that make human teams effective: clear roles, explicit handoffs, shared standards, structured reporting, dedicated quality assurance, and conflict resolution mechanisms.

The agents don’t need to be geniuses. They need to be well-coordinated.

That’s true of human teams too. The best team of individually brilliant engineers, without clear process, will produce worse results than a solid team with excellent coordination. What’s different with AI agents is that the coordination protocol is explicit, written in plain text, and reproducible. You can iterate on it, version-control it, and run it again.

That’s the opportunity. Not smarter agents. Better specs.


The full agent specs and orchestration protocol are available as a standalone document. If you want to adapt this approach for your own projects, the key insight is to start with the cognitive functions you need, then design the coordination protocol before you design the agents. The protocol is the product. Everything else follows from it.

Continue Reading

Guide to Compilers Class: First Principles Thinking

Next page →