The Loop Closes: When AI Agents Start Reviewing Each Other
Five Days, Zero Hands
Something happened this week that we've been building toward for months.
A task was created. An agent picked it up, wrote code on a branch, committed, and pushed. Another agent — a dedicated code reviewer — was automatically spawned to inspect the changes. It read the diff, scored the quality, and passed the review. The system merged the pull request on GitHub. The task closed itself.
No human touched anything.
It sounds simple when you say it fast. But the machinery behind that 48-second cycle took five intense days to build.
The Service Pipeline
The core innovation is what we call the Service Pipeline — a system that lets tasks trigger other tasks as part of their lifecycle.
When an agent finishes coding and submits for review, the pipeline kicks in:
- The system checks the task's review mode — is it a quick scan, a thorough solo review, or a full deep analysis?
- A specialized reviewer agent is spawned with instructions calibrated to that intensity level
- The reviewer examines the code and submits a verdict
- If it passes: the branch gets merged automatically
- If it fails: feedback is injected back to the original agent, who gets re-spawned to fix the issues
- If it fails twice: the reviewer takes over the task directly
{
"type": "doughnut",
"title": "Review Intensity Levels",
"labels": ["Basic (~5 min)", "Normal (~15 min)", "Superpowered (~30+ min)"],
"datasets": [
{ "label": "Typical duration", "data": [5, 15, 35] }
]
}
The beauty is in the composability. The same pipeline that handles code review could handle security audits, documentation checks, or test generation — any task where you want a second pair of eyes before moving forward.
The Refactoring Sprint
While we were building pipelines, our worker agents were busy with a parallel refactoring effort:
Component,Before,After,Reduction
MobileApiController,1690 lines,537 lines,68%
OverlordMultichat,1217 lines,717 lines,41%
Task Model,645 lines,513 lines,21%
FloatingChat,453 lines,320 lines,29%
The MobileApiController — a classic "god class" mixing authentication, chat, tasks, memory search, and notifications — got split into six focused domain services. The chat components got decomposed into reusable traits. The Task model's state machine was extracted into a dedicated service.
All reviewed. All deployed. All running in production.
Learning the Hard Way
Not everything went smoothly. Some highlights from the debugging sessions:
[!WARNING] The Ephemeral Worktree Problem: Our first agent completed a task beautifully — then we realized the worktree it worked in was temporary. The code was never pushed. Gone. We now auto-push unpushed commits after every agent execution.
[!NOTE] The Identity Crisis: One agent kept authenticating as the wrong identity because a global config file was overriding the bridge-injected settings. Took three debugging sessions to find it.
[!TIP] The Self-Review Loop: When a reviewer finishes its work, that completion event looks just like any other task finishing — which tried to trigger another review. Of that review. Ad infinitum. Guard clauses are important.
The Test Project
To validate the pipeline end-to-end, we ran a series of real tasks on a small Python utility project. An agent built a Rich TUI interface, added batch processing, a dry-run preview mode, version flags, and a changelog — each going through the full branch → review → merge cycle.
{
"type": "bar",
"title": "Test Project: Pipeline Runs",
"labels": ["TUI Phase 1", "Batch Flag", "Dry-Run", "Version Flag", "Changelog", "One-Liner"],
"datasets": [
{ "label": "Attempts", "data": [3, 1, 2, 1, 2, 1] },
{ "label": "Reviews", "data": [2, 1, 3, 1, 2, 1] }
]
}
The last task — adding a single comment line — was the one that completed the first fully automated cycle. Sometimes the simplest test proves the most.
What This Means
We're not replacing human judgment. The review intensity system exists precisely because not everything needs the same level of scrutiny. A one-line documentation change gets a quick scan. A security-sensitive authentication rewrite gets the full treatment with parallel analysis agents.
But the loop is closed now. Tasks can flow from creation through execution, review, and deployment without blocking on human availability. The humans decide what to build and how carefully to review it — the system handles the rest.
What's Next
- Activity monitoring — a real-time dashboard showing what every agent is doing right now
- Review tuning — per-project defaults so teams can set their own quality bar
- Bridge improvements — better handling of agent re-spawning after service pipeline completion
- More service types — the pipeline isn't limited to code review; test generation and documentation are next
60 commits. 5 days. One closed loop.