Proving Agent Quality With Data
The Thesis
Can a system of specialized AI agents running on local models (Gemma 26B, free) match or beat a monolithic cloud agent (Claude Sonnet, paid) for personal task management?
We’re testing this with a series of experiments, each building evidence for the next. Every claim is backed by eval scores, not opinions.
The Scorecard
| Domain | Rusty | Specialist (Sonnet) | Best Local | Model | Verdict |
|---|---|---|---|---|---|
| Media | 2.52 | 3.53 (+40%) | 3.57 | Gemma 26B | Local viable |
| Productivity | 3.85 | 3.85 (+0%) | 4.11 | Gemma 31B | Local wins (+6.8%) |
| Cross-domain | — | — | — | — | Planned |
| Supervisor | — | — | — | — | Planned |
The Experiments
EXP-001: Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores? Specialization improves media evals by 40%. Gemma 26B matches Sonnet on the focused domain. Also discovered that eval infrastructure quality (harness mocks, scorer bugs) matters more than you think.
EXP-002: Do Mock Evals Predict Real-World Agent Quality? Real APIs score 5.6% higher than mocks. 93% of evals stable across 3 runs. Mock evals are a trustworthy conservative lower bound. Only variance source is model non-determinism, not API instability.
EXP-003: Does Agent Specialization Replicate for Productivity Tasks? After fixing another scorer bug (stale gold files), the story changed: specialization provides zero benefit on Sonnet for productivity. But Qwen 3.5 is only 5.7% behind Sonnet — close enough that scaffolding techniques might close the gap.
EXP-004: Cross-Domain Routing What happens when a request spans two agents? (“Find that movie and add a watch party to the calendar.”) Tests routing architectures for multi-agent coordination.
EXP-005: Failure Modes & Recovery How do agents handle errors, ambiguity, and wrong information? Production quality means graceful degradation.
EXP-007: Rusty on Local Models (Capstone) Can the supervisor agent run on a local model once specialized agents handle the domains? The final test of a fully local stack.
What We’ve Learned So Far
-
Specialization works for narrow domains, not broad ones. +40% for media (46→7 tools), +0% for productivity (46→20 tools). The threshold appears to be around 70-80% tool reduction. Below that, Sonnet doesn’t need the help.
-
Local models can beat cloud. Gemma 26B matches Sonnet on media. Gemma 31B beats Sonnet on productivity by 6.8%. The 26B→31B jump (+15%) mattered more than any prompt engineering. Model scale > scaffolding tricks.
-
Your eval infrastructure matters as much as your agents. We’ve found scorer bugs in two separate experiments that changed our conclusions by 30-40%. Validate your evaluator every time you change your evals — not just when you set them up.
-
Mock evals are a trustworthy lower bound. Validated against real APIs — real-world scores are 5.6% higher than mocks. Safe to develop against mocks, validate against production periodically.
Methodology
All experiments follow practitioner best practices from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org:
- Binary pass/fail scoring on specific dimensions (not vague 1-5 “quality”)
- Domain expert calibration (one person’s judgment drives the system)
- Data-first: look at outputs before defining criteria
- Production flywheel: traces → human judgment → eval cases → automation
- Different model families for generation vs judging (avoids self-enhancement bias)
Full reference: Production Eval Reference