Long-running multi-agent orchestration via Missions

An analysis of Factory AI's approach to long-running multi-agent orchestration via the "mission" pattern.

This post analyses the architectural pattern behind "missions", a multi-agent workflow for software engineering that runs over hours and days rather than minutes. The material is drawn from two recent talks given by engineers at Factory AI: a long-form podcast walkthrough with CTO Eno Reyes, and a conference presentation by Luke Alvoeiro, who leads their core agent harness. The intent here is not to summarise either talk in isolation, but to extract the architectural ideas that are worth taking seriously, regardless of which agent harness or vendor a user might be working with.

Disclaimer: I am not in any way affiliated with Factory AI, but found the content interesting enough to try to provide an informational analysis on key takeaways from these talks.

Two anchoring data points from the talks are worth stating upfront, because the rest of the analysis depends on whether one believes them.

  1. The longest production run of a single mission described in the talks lasted sixteen consecutive days.
  2. The architecture is reportedly viable up to roughly thirty days before validation overhead and context drift make returns negative.

Whether these numbers generalise outside the specific harness they were measured in is an open question, but the pattern is interesting precisely because it claims to make multi-day agentic work possible at all.

Human attention is the bottleneck

Today's frontier models are smart enough to plan and complete fifty backlog items in a sprint. Meanwhile, a human engineering team only ships a few per day. This is not due to any factors related to intelligence, but rather that every task requires a human in the loop to scope, review, and unblock. Thus, the most valuable thing an agentic system can do is not to be smarter, but to be capable of absorbing more of the supervision load such that the human can decide on what-to-build rather than the how-to-build.

This reframed focus matters because it cleanly separates "agent quality" from "agent throughput". A system that produces better code per token but still requires per-task human supervision has not moved the bottleneck. A system that produces slightly lower quality code per token but is supervisable in batches at milestone boundaries has, in throughput terms, won. The missions pattern is engineered for this second case.

Strategies for orchestration

The talks propose a useful taxonomy of five frontier multi-agent strategies:

  • Delegation, in which one agent spawns another to investigate or implement a sub-task, is what most tools implement first and is the basis of sub-agents in coding harnesses.
  • Creator-verifier, which separates the agent that builds something from the agent that checks it, on the theory that a fresh context is more likely to find issues than the cost-biased agent that just wrote the code.
  • Direct communication, in which agents DM each other without a central coordinator, fragments state across conversations and is consequently hard to keep coherent.
  • Negotiation, which occurs when agents must coordinate over shared resources, ideally as positive-sum trades rather than adversarial conflicts.
  • Broadcast, which covers status updates and shared constraints flowing from one agent to many, and is unglamorous but essential for keeping a long-running workflow on the same page.

Missions compose four of these strategies into a single workflow (delegation, creator-verifier, broadcast, and negotiation). Direct communication is deliberately omitted. Every agent reads from and writes to a shared mission state instead of messaging peers. Most "agent swarm" demos lean heavily on direct communication, which produces visually impressive coordination but tends to drift over long horizons because no single artifact is authoritative. Replacing peer-to-peer messaging with shared-state reads-and-writes is the kind of structural decision that does not look exciting in a demo but pays back continuously over a multi-day run. The reason is that a single authoritative artifact gives every agent the same picture of the mission, lets a crashed or swapped-in worker resume by reading state rather than replaying conversations it never saw, and forces contradictions to surface at write-time instead of compounding silently across pairwise channels until they detonate at integration.

Three-role orchestration architecture

Three role types were referenced in this missions pattern blueprint: orchestrators, workers, and validators.

The orchestrator handles planning. When the operator describes what they want, the orchestrator interrogates them about scope, target users, stack preferences, and integration assumptions, and then produces a plan containing features, milestones, and a validation contract. Workers handle implementation. Each worker receives a feature with a clean context window (no accumulated baggage from prior turns), reads its spec, implements the feature, commits the result via git, and hands off to the next worker. Validators handle verification, and they come in two distinct flavours.

The scrutiny validator runs the conventional checks: lint, type check, and the test suite. Critically, it also spawns dedicated code review sub-agents in parallel for each feature in the milestone, so reviews are not done by the agent that wrote the code. The user-testing validator behaves more like a QA engineer. It launches the application, interacts with it through computer use or a browser harness, fills out forms, clicks buttons, and verifies that functional flows actually work end-to-end. This second validator is where most of a mission's wall-clock time is spent, since it is waiting on real applications to render rather than generating tokens.

Neither validator has seen the implementation code before being asked to verify it. Validation is consequently adversarial, which is what allows the workflow to run for many days without quietly drifting away from the original intent. Code review by the agent that wrote the code is structurally biased no matter how much one prompts around it, and a fresh-context verifier is the cheapest available correction.

Validation contracts must be load-bearing

Of everything described in the two talks, the validation contract is the single most important construct. The contract is a structured list of assertions that defines what "done" means, and it is written by the orchestrator during planning, before any code is written. For a complex project it may contain hundreds of assertions. Each feature in the implementation plan must be mapped to one or more assertions it satisfies, and the sum of all features must cover every assertion.

The motivation for writing the contract first is straightforward. Tests written after implementation do not catch bugs; they confirm whatever decisions the implementer already made. The agent has a cost bias toward the code it just produced, so any validation it shapes around that code will silently reflect that bias. By writing the contract upfront, with no implementation to anchor against, the assertions describe correctness independently of any particular execution path.

Each assertion also declares how it will be proved. An example from the live demo: "the application loads successfully" is paired with a procedure that runs the dev server, checks the console for errors, checks the network tab, and takes a screenshot. "Light mode renders correctly" is paired with screenshots of both light and dark and a console check. This is the structural shape that lets the orchestrator dispatch validation as something an agent can mechanically execute, rather than as a vague human judgment call.

It should also be noted that this design corrects an under-appreciated training-time fact. Almost all coding agents are reinforced under a reward signal of correctness against some specification. If one can formulate an arbitrary task as a reward signal at inference time, the system is effectively meeting the model on the surface it was trained to optimise. The validation contract is, in that sense, a runtime construction of the same shape the training loop already privileges. Agentic systems that explicitly externalise their reward signal tend to outperform systems that leave the success criterion implicit, and the gap widens as the task horizon grows.

Serial execution with targeted parallelism

The intuitive design for a multi-agent system is to fan out: ten workers, ten features, ten times the throughput. The teams behind missions tried this, and it does not survive contact with software engineering tasks. Parallel workers step on each other's changes, duplicate work, and make architecturally inconsistent decisions. The coordination overhead eats the speed gains while burning tokens.

Missions consequently runs features serially. Only one worker or validator is active at any given point. Within a feature, read-only operations such as codebase search, documentation lookups, and API research are parallelised. Within a validator, code review across multiple features in a milestone is also parallelised. This is described as "controlled chaos": the main orchestration thread barrels through the problem linearly, while sub-tasks parallelise when they are read-heavy and decomposable.

There is supporting empirical work on this. Research from Google and Augment Code corroborate that one can determine in advance when to parallelise based on the read-versus-write ratio of the task and the task's complexity. Read-heavy work (e.g., deep research) parallelises well; complex work with cross-cutting context does not, because branches of context cannot be cleanly delegated out without losing the threads that tie them together.

Skills as the pillar of evolution

Skills, in terms of AI harness vocabulary, are markdown documents that encode either reference material or workflows. They serve a similar role to AGENTS.md or instructions.md files in other harnesses, but missions leans on them more heavily because the workflow needs a stable surface to refine over time.

When a mission is planned, the orchestrator writes per-role skills (for example, a full-stack worker skill, a back-end worker skill, a front-end worker skill) that describe how to operate inside the specific project. When a worker hits a friction point during execution, for example "the back end cannot be started without Docker running first", it edits its own skill to capture the lesson. Future workers in the same mission, and any future mission run against the same codebase, inherit that fix. With these skills committed to a directory alongside the code, they become a shared, durable asset.

This is continuous learning at the project level, written into the codebase rather than into model weights. A team that runs many such workflows against a single codebase ends up with a steadily improving set of skills that encode the project's actual operational quirks.

The dynamic is also fragile in the opposite direction, and that fragility is worth naming. A workflow that runs forty tasks in sequence cannot afford to re-learn the same friction point on every task. Without the skill-rewriting loop, every individual issue would compound, and the run would degrade rapidly past about ten tasks. The continuous learning mechanism is consequently not a nice-to-have; it is what makes the multi-day horizon viable at all.

Structured handoffs and self-healing

When a worker finishes a feature, it does not simply declare success. It fills out a structured handoff containing: what was completed, what was left undone, what commands were run, the exit codes of those commands, what issues were discovered, and whether the worker abided by the orchestrator-defined procedures.

Errors are caught at milestone boundaries. If a feature's assertions are not satisfied, the orchestrator can scope corrective work as a follow-up feature, rescope a milestone, or roll back. The system pulls itself back on track by forcing agents to write down what happened, rather than relying on them to remember across context windows. This is, again, an architectural decision that looks unremarkable in isolation but compounds favourably over long horizons: a workflow that cannot articulate why it succeeded or failed cannot be debugged, and an undebuggable workflow cannot be operated for sixteen days.

Picking the right model for each role

The missions architecture is model-agnostic by design. Planning benefits from slow, careful reasoning. Implementation benefits from fast code fluency. Validation benefits from precise instruction following. No single model family is best at all three. The recommendation from the talks is to put a different model provider in the validation seat than the implementation seat, so that the same training data does not bias both ends of the loop.

This is a structural advantage of a model-agnostic harness. A system locked into one provider is, evidently, only as strong as that provider's weakest capability. As models continue to specialise, the ability to assign the right model to the right seat becomes a compounding advantage, although this is not without cost: someone on the team has to develop the intuition for which model fits which seat. The inverse is also true: the structure of the workflow can compensate for weaker models, since the validation contracts and milestone checkpoints reliably produce working code even with open-weight models in the worker seat.

Defending against evolving models and hardware

Every team building multi-agent systems lives with the fear that the next model release will collapse the architecture into a single prompt. The deliberate response described in the talks is to put almost all of the orchestration logic into prompts and skills rather than into a hard-coded state machine. The feature decomposition logic, the failure handling, and the negotiation behaviour at milestone boundaries live in roughly seven hundred lines of natural language. The deterministic code is thin and mostly handles bookkeeping (running validation, blocking progress on unaddressed handoff issues, persisting state). The structure provides the discipline, the models provide the intelligence, and as models improve the whole system improves automatically.

More generally, the parts of an agentic system that benefit from improving model capability should be expressed in prompts, and the parts that benefit from determinism (state persistence, gating, bookkeeping) should be expressed in code. Mixing the two layers tends to produce systems that get worse as models improve, because the deterministic logic encodes assumptions that no longer hold.

Brownfield first, greenfield second

A useful detail in the talks is that the validation primitives were deliberately built against brownfield workloads first. The early users were running legacy modernisation and migration tasks on real production codebases, which forced the team to build validation harnesses that survive contact with messy state. Once brownfield validation worked, greenfield fell out essentially for free, because the operator gets to pick a stack that is trivial to validate and the contract writes itself.

The corollary is a useful diagnostic for whether to use this kind of workflow at all. If one can articulate clear validation criteria for the work, the workflow will probably succeed. If one cannot validate the work, because the system has no test harness, no observable behaviour, or no inspectable output, the workflow will spend hours, burn tokens, and produce something that is roughly 85% correct. The marginal cost of cleaning up the remaining 15% will exceed the cost of doing the task manually, which is the unglamorous truth that more enthusiastic framings of agentic software engineering tend to elide.

Where the boundary lies

The talks describe an internal benchmark in which the workflow attempts to construct clones of popular productivity and dev tools end-to-end. Early results suggest the lower-tier projects (small feature sets) ship cleanly. The higher-tier projects (one hundred plus features) start failing for reasons that have nothing to do with the model's coding ability. They fail because the validation harness itself is the bottleneck.

Concretely, a Zapier-style system cannot be meaningfully validated on localhost without spinning up Firecracker VMs to execute the actual workflows. A SaaS application with auth gated behind a bot-protected OAuth flow cannot be exercised by a browser agent. A multi-region deployment cannot be verified without real cloud infrastructure access. The frontier is consequently not the model's reasoning ability, but rather the gap between localhost behaviour and production behaviour. Closing that gap requires giving the agent access to real infrastructure, real monitoring, and real deployment surfaces, with all the operational and security consequences that implies.

Takeaways

After ingesting the information from these two talks back-to-back, a handful of architectural lessons stand out regardless of the harness being used.

  • Externalise the success criterion. The validation contract works because correctness is defined before code exists, and because each assertion declares how it will be proved. Any system that leaves success implicit will drift over long horizons.
  • Separate the agent that writes from the agent that verifies. Code review by the same agent that produced the code is structurally biased, and a fresh-context verifier is the cheapest available correction.
  • Prefer serial execution with read-heavy parallel sub-tasks over parallel execution of write-heavy primary tasks. Coordination overhead between writing agents almost always exceeds the throughput gain.
  • Encode workflow logic in natural-language prompts and skills, and reserve deterministic code for state and gating. Mixing the two layers produces systems that get worse as models improve, because the deterministic logic encodes assumptions that no longer hold.
  • Make the skills layer continuously editable by the workers themselves. Project-specific friction compounds across long-running workflows, and capturing the lessons in committed markdown files is the cheapest substitute for fine-tuning.
  • Treat structured handoffs as a first-class artifact. A workflow that cannot describe its own state at each checkpoint cannot be debugged.
  • Accept that the binding constraint is now validation, not generation. The interesting open problem is not "can the model write this code" but "can we mechanically verify whether it did". If the answer to the second is no, their claim is that the workflow will still produce something roughly 85% correct, but repairing the remaining 15% usually costs more than redoing the task by hand.

Open questions

A handful of questions remain open:

  • How much further can serial execution be parallelised without losing coherence? The architecture leaves room for more, but the failure modes of write-heavy parallelism are not yet well-characterised.
  • How should these workflows themselves be orchestrated into higher-order workflows? At some point the cost of running a mission for an entire project is dominated by the cost of decomposing the project into missions in the first place.
  • What shape of validation harness can close the gap between localhost behaviour and production behaviour for arbitrary applications? Without that bridge, the most interesting workloads (the ones that actually look like production SaaS) remain out of reach.
  • What governance shape is appropriate for software built primarily by agents under a human-authored constitution? Centralised coordination, decentralised self-healing, or some middle ground that the field has not yet articulated.

Evidently, the data shown in the talks imply that the underlying mechanism works on real projects at scale today, at least inside the harness it was built in. But, there is an open ended question as to what happens once "agentic systems that run for days" becomes the floor rather than the ceiling. Hence, it would probably be in the best interest of everyone in the software engineering field to keep an eye out on how this area of focus evolves as technology in this space matures. Hopefully, this post proves to be useful for those inclined to implement their own version of this multi-agent workflow.

← Back to all posts