First Blood: An SDLC Colony Running Against Local Nemotron

We ran our first real SDLC pipeline last week. Four agent colonies. A local Nemotron model doing inference. GitHub as the environment. No human in the loop.

The first attempt failed. The second one succeeded. The third one hit an edge case. Nobody intervened for any of it. The system corrected itself through the same mechanism it uses to do work: signals in the environment.

This is the story of that run, and why it matters.

The setup

We have four colonies running against a GitHub repository called golem-test:

Coder - senses issue:needs-code signals, generates code, opens PRs
Fixer - senses issue:code-fix signals, applies targeted bug fixes
Reviewer - senses pr:needs-review signals, produces structured review verdicts
DocGen - senses artifact:merged signals, generates documentation

The colonies don't know about each other. They coordinate through GitHub labels mapped to signal types: a needs-code label on an issue becomes an issue:needs-code signal. A needs-review label on a PR becomes a pr:needs-review signal. The mapping is bidirectional - colonies can deposit signals that become labels, and labels added by humans become signals.

Inference runs locally on a DGX station via ollama (vLLM is supported), serving NVIDIA's Nemotron model. No cloud API calls. The entire pipeline is self-hosted.

Forge:     github
Dry run:   false
Dashboard: http://localhost:4040
Inference: http://localhost:11434
Model:     nemotron-3-super

We labeled a GitHub issue with needs-code and watched what happened.

Attempt 1: budget exhaustion

The coder colony sensed the signal, claimed it, cloned the repo, loaded its skills, and started generating code.

[14:22:17] [coder] Cloning sparrowlabsdev/golem-test...
[14:22:18] [coder] Workspace ready
[14:22:18] [coder] Loaded 4 skill(s): git-workflow, tdd-process,
                   codebase-conventions, feature-implementation

Four and a half minutes later:

[14:26:47] [coder] Token budget exhausted (221828/200000)
[14:26:47] [coder] Branch feature/issue-1 not found on remote
                   — LLM did not push. Work incomplete.

The model burned through 221k tokens without pushing a branch. Maybe it was exploring too broadly, or maybe the cold start ate into the budget. Either way, it didn't finish the job.

In an orchestrated pipeline, this is where you need error handling. A supervisor notices the failure, decides whether to retry, routes the work somewhere else, maybe pages someone. You've written this retry logic before. It's always more complex than you expected.

Here, nothing happened. The coder's lease on the signal expired. The needs-code label was still on the issue. The signal was still in the environment. It became unclaimed.

Attempt 2: success

Thirty seconds later, the coder colony's next poll cycle sensed the same unclaimed signal and picked it up again. Fresh context, clean slate.

[14:27:17] [coder] Workspace ready
[14:27:17] [coder] Loaded 4 skill(s): git-workflow, tdd-process,
                   codebase-conventions, feature-implementation
[14:29:46] [coder] Tool loop completed after 12 turns, 11 tool calls.
[14:29:47] [coder] Branch feature/issue-1 verified on remote.
[14:29:47] [coder] Creating PR for issue #1 from feature/issue-1...
[14:29:48] [coder] PR created: https://github.com/.../pull/22

Twelve turns. Eleven tool calls. Two and a half minutes. The model generated the code, committed, pushed, and the colony created a PR. The needs-review label was applied automatically, signaling the reviewer colony.

Same issue. Same colony type. Different outcome. The only difference was that the second attempt had a warm model, didn't waste tokens on a cold start exploration phase, and already had some code written.

Attempt 3: the duplicate PR edge

Then something interesting happened. The signal got picked up a third time. The coder generated code again (17 turns, 16 tool calls - it actually wrote more code this time), pushed successfully, but:

[14:34:04] [coder] Branch feature/issue-1 verified on remote.
[14:34:04] [coder] Creating PR for issue #1 from feature/issue-1...
[14:34:05] [ERROR] [coder] Error processing signal: PR creation
           failed (422): A pull request already exists for
           sparrowlabsdev:feature/issue-1

A PR already existed from attempt 2. GitHub rejected the duplicate. The colony logged the error and moved on. No crash. No cascade failure. Just an error on one signal processing cycle.

This is actually a bug in our code - the coder should check for existing PRs before trying to create one, or the signal should have been consumed after the successful PR creation. We'll fix it. But the point is that the bug didn't bring down the system. The colony processed the error, released the signal, and kept polling for new work.

What you're seeing

Three attempts at the same task. One budget exhaustion. One success. One edge case error. Zero human intervention. Zero retry logic in application code.

This is what stigmergy gives you that orchestration doesn't: failure is just a signal that didn't get consumed.

When the coder's first attempt fails, the work isn't "lost" or "stuck in a dead letter queue" or "waiting for a supervisor to decide what to do." The label is still on the issue. The environment hasn't changed. The next poll cycle picks it up like nothing happened. The system's recovery mechanism is the same as its normal operating mechanism.

Think about what we didn't build:

No retry policy configuration
No exponential backoff
No dead letter queue
No supervisor process
No error routing
No circuit breaker

The lease expired. The signal remained. Work resumed. That's it.

Running local models changes the economics

Running Nemotron locally instead of calling a cloud API changes the failure calculus. When the first attempt burned 221k tokens, that cost us GPU time we were already paying for - not a $4 API bill. When the model needs a cold start, we wait a bit longer but don't pay per-token.

This matters because stigmergy expects some attempts to fail. The system is designed around the assumption that agents are unreliable. If each failed attempt costs real money at cloud API rates, you start adding guardrails that defeat the purpose - pre-validation, circuit breakers, budget caps that prevent the work from completing. With local inference, the cost of a failed attempt is just time on hardware you own.

The token budget we set (200k) was actually too low for the model's first-attempt behavior. We bumped it to 500k after this run. But even before the fix, the system self-corrected. That's the property we're optimizing for: correctness despite misconfiguration.

The colony handoff

After the coder created the PR, it applied the needs-review label. This is how colony-to-colony handoff works without direct communication:

Coder finishes → applies needs-review label to PR
Label becomes pr:needs-review signal in the environment
Reviewer colony senses the unclaimed signal
Reviewer claims it, reads the diff, produces a verdict
Reviewer deposits review:approved or review:changes-needed
Those become labels on the PR

No colony knows the other exists. The coder doesn't "send" work to the reviewer. It modifies the environment. The reviewer notices. The same way ants don't tell each other where food is - they leave a trail, and other ants follow it.

If the reviewer colony is down, the label stays on the PR. A human can review it. When the reviewer comes back up, it skips already-reviewed PRs. If you add a second reviewer colony instance, they split the work automatically through claim contention. Scale up, scale down, swap humans for agents - the environment doesn't care who's doing the sensing.

What we're building toward

This was our first real run. The immediate bugs to fix are clear: handle duplicate PRs, ensure signals are consumed after successful processing, tune token budgets for cold-start behavior.

But the architecture held. Four colonies started, found work, coordinated through labels, handled failures, and produced a real pull request with generated code - all running against a local model on our own hardware.

The next step is closing the loop: reviewer produces a verdict, coder responds to change requests, and the cycle continues until the PR is approved. Then the merge colony (not built yet) handles the final step. Each piece is just another colony sensing signals in the environment.

No orchestrator needed. The environment is the orchestrator.

Check out the GitHub repo to get started, we're also building Mandible Cloud to deploy your own colonies.