EXP-003: Does Agent Specialization Replicate for Productivity Tasks?
The Question
In EXP-001, splitting media tools into a focused agent improved eval scores by 40%, and Gemma 26B matched Sonnet on that narrow domain. Does this pattern hold for productivity — email, calendar, contacts, family management?
The Evals
18 tasks themed around real suburban family life in Downers Grove, IL:
- “There should be an email from the school about picture day. Can you add it to our calendar?” (email → read → calendar)
- “I got an email from the basketball league with a link to the season schedule. Can you grab the dates?” (email → read → browser → calendar)
- “Can you find that email about the thing at school next Friday?” (deliberately vague)
- “Email the team parents that Saturday’s soccer game is cancelled” (compose + send)
- “Can the kids do swim lessons on Wednesdays at 4? Check if we have anything” (conflict check)
These are harder than media evals — multi-step chains requiring email search → read → extract details → sometimes follow links → create calendar events. Following Hamel’s eval methodology, each is scored on specific dimensions (tool selection, completeness, actionability) rather than vague quality ratings.
Score Timeline
Initial Run: Rusty Baseline scorer broken
First run showed Rusty at 2.77, Sheila Sonnet at 2.89 (+4.3%), Gemma at 2.35 (-18.7%), Qwen at 2.62 (-9.3%). Looked like specialization helped a little and local models had a big gap. But something was wrong — args_accuracy scored 0 on every eval, every model.
The Scorer Bug (Again) stale gold files
Same class of bug as EXP-001. Gold standard files from the pre-consolidation architecture (S-01 expected "search contacts for Cindy, send email about Plex" but S-01 was now "What do we have going on this week?"). The scorer matched by task ID, applied wrong expectations, scored 0, and penalized tool_selection by up to 1.5 points. Following Hamel's validate-evaluator skill: you have to verify your scorer is correct before trusting what it says about your agents. We learned this in EXP-001 and still hit it again.
Rusty Baseline (fixed scorer) +39%
After removing 88 stale gold files. Rusty with 46 tools on Sonnet handles family management well. The monolithic agent is a solid baseline.
Sheila Sonnet (fixed scorer) +0%
Specialization provides zero improvement on Sonnet for productivity. Unlike media (+40%), narrowing from 46→20 tools doesn't help when Sonnet can already handle the full set. The decision space reduction (57%) isn't enough to matter for a capable model.
Sheila Gemma 26B -7.5% vs Sonnet
Gemma's gap is much smaller than initially measured (-7.5% vs the bugged -18.7%). Multi-step chains still degrade but the model is closer to viable than we thought.
Sheila Qwen 3.5 -5.7% vs Sonnet
Best local model for productivity. Only 5.7% behind Sonnet — close enough that scaffolding techniques (chain-of-thought planning, tool call validation, retry with reflection) might close the gap entirely.
Full Model Comparison (Fixed Scorer)
| Config | Bugged Score | Fixed Score | Change |
|---|---|---|---|
| Rusty (Sonnet, 46 tools) | 2.77 | 3.85 | +39% |
| Sheila (Sonnet, 20 tools) | 2.89 | 3.85 | +33% |
| Sheila (Gemma 26B, optimized) | 2.56 | 3.56 | +39% |
| Sheila (Qwen 3.5) | 2.62 | 3.63 | +39% |
Key pattern: The scorer bug inflated specialization gains and local model gaps equally. With honest scoring, Sonnet doesn’t need specialization for productivity (46 tools is fine), and local models are much closer than initially measured.
Why This Experiment’s Story Changed
1. Specialization doesn’t always help. For media (7 tools, short chains) it was +40%. For productivity (20 tools, long chains) it’s +0%. The benefit depends on how much you narrow the scope. Sonnet handles 46 tools without confusion — narrowing to 20 doesn’t reduce its decision-making burden meaningfully. Following Braintrust’s guidance on using eval data to drive architecture decisions: the data says don’t specialize productivity on Sonnet.
2. Local models are closer than we thought. Gemma at -7.5% and Qwen at -5.7% are within striking distance. Techniques like chain-of-thought forced planning, tool call validation, and retry-with-reflection could close these gaps without switching models or adding cost.
3. Scorer bugs are the gift that keeps giving. This is the second time stale infrastructure created measurement errors that changed our conclusions. Hamel’s validate-evaluator step isn’t a one-time check — it’s something you do every time your eval tasks change. We didn’t, and it cost us a full round of wrong conclusions.
Devil’s advocate: The “zero specialization benefit” finding on Sonnet should be tested on more domains before generalizing. Media’s +40% was real. The difference might be about chain length and tool count thresholds — there could be a domain where specialization helps on Sonnet (maybe 30+ tools, or 6+ step chains). One data point isn’t a pattern.
Phase 2: Can Scaffolding Close the Gap?
The 5.7% Qwen gap (3.63 vs 3.85) seemed closeable with engineering techniques. We tested five approaches commonly recommended for improving local model tool calling:
Qwen 3.5 Baseline (no scaffolding)
The number to beat. 5.7% below Sonnet.
+ Chain-of-Thought Planning -5.2%
Forced the model to output a numbered plan before tool calls. Made things worse — the planning text ate context budget and the model sometimes planned correctly but then deviated during execution.
+ Tool Call Validation -1.9%
Added pre-call validation instructions (check required fields, verify email format, etc.). Minimal impact — the model already gets args right most of the time. The extra instructions added noise.
+ Retry with Reflection -8.5%
Worst performer. Told the model to reflect on failures and retry. Instead, it started second-guessing successful results — "that search returned results but let me try different terms to be sure." More iterations, lower quality.
All Combined -2.2%
Layering all three techniques. Still worse than baseline. The combined prompt is ~400 tokens of scaffolding instructions that push the actual task context further from the model's attention.
Why Scaffolding Failed
This was the most surprising finding. The techniques are well-regarded in the community — Braintrust recommends structured prompts for tool use, and chain-of-thought is a standard practice. But the data says otherwise for this use case.
The scaffolding is solving the wrong problem. These techniques help models that know what to do but make execution errors (wrong args, skipped steps). Local models on productivity tasks have a reasoning problem — they’re uncertain about what the next step is in a 4-step chain. Adding instructions about how to execute doesn’t help when the model is uncertain about what to do next.
More prompt = less attention on the task. The scaffolding instructions added 200-400 tokens to the system prompt. For local models with smaller effective context windows, this pushes the tool descriptions and few-shot examples further away. The model pays attention to the scaffolding rules instead of the task.
Devil’s advocate: We only tested prompt-based scaffolding. Code-level scaffolding (actual validation middleware, programmatic retry loops, orchestrator patterns) might work differently because it doesn’t consume prompt budget. That’s a different experiment — engineering the harness, not the prompt.
What This Means
Phase 3: More Models
We tested 4 additional models to see if a different architecture closes the gap.
| Model | Size | Score | vs Sonnet | Notes |
|---|---|---|---|---|
| Sonnet 4.6 | Cloud | 3.85 | — | Baseline |
| Gemma 26B + retry | 16GB | 3.67 | -4.7% | Best local (Phase 3a) |
| Gemma 26B | 16GB | 3.56 | -7.5% | |
| Qwen 3.5 9.7B | 6.6GB | 3.54 | -8.1% | |
| Qwen 3.5 4B | 3.4GB | 3.45 | -10.4% | |
| Nemotron 3 Nano 4B | 2.8GB | 3.08 | -20.0% | Underwhelming |
| GLM4 9B | 5.5GB | 0.14 | — | Broken on Ollama |
GLM4 (Zhipu AI) scored 95% on external tool-calling benchmarks but 0.14/5 on ours — completely broken. Chat template incompatibility with Ollama makes it unusable despite strong benchmark numbers. This confirms what the Docker blog evaluation found: “BFCL leaderboard leaders have inference compatibility issues that tank real-world performance.”
Qwen 3.5 4B scored 97.5% on the jdhodges benchmark but only 3.45/5 here. Public benchmarks test simpler chains than our family management tasks. The gap between “can call a single tool correctly” and “can chain 4 tools with reasoning between steps” is significant.
Surprise winner (Phase 3a): Gemma 26B + retry scaffolding (3.67) — a technique that hurt Qwen actually helped Gemma. Scaffolding is model-specific, not universal.
Phase 4: Scaling Up
The Phase 3 results left a 4.7% gap. Could a larger local model or a different hosted provider close it?
Gemma 31B (no scaffolding) +6.8% vs Sonnet
Beats Sonnet. The jump from 26B to 31B is massive (+15%). Those extra 5B parameters make a real difference on multi-step chains. No scaffolding needed — capable models don't need the help.
Gemma 31B + retry +5.7% vs Sonnet
Retry scaffolding slightly hurts Gemma 31B — same pattern as Sonnet. When a model is capable enough, scaffolding adds noise instead of structure.
MiniMax M2.7 (hosted) -1.3% vs Sonnet
Essentially tied with Sonnet at a fraction of the cost (~$1/M vs $3-15/M tokens). A strong hosted fallback — not quite Gemma 31B, but solid vendor diversification.
Full Model Comparison (All Phases)
| Model | Type | Score | vs Sonnet |
|---|---|---|---|
| Gemma 31B | Local (20GB) | 4.11 | +6.8% |
| Gemma 31B + retry | Local (20GB) | 4.07 | +5.7% |
| Sonnet 4.6 | Cloud ($3/M) | 3.85 | — |
| MiniMax M2.7 | Cloud (~$1/M) | 3.80 | -1.3% |
| Gemma 26B + retry | Local (18GB) | 3.67 | -4.7% |
| Gemma 26B | Local (18GB) | 3.56 | -7.5% |
| Qwen 3.5 9.7B | Local (6.6GB) | 3.54 | -8.1% |
| Qwen 3.5 4B | Local (3.4GB) | 3.45 | -10.4% |
| Nemotron 3 Nano 4B | Local (2.8GB) | 3.08 | -20.0% |
| GLM4 9B | Local (5.5GB) | 0.14 | Broken |
What This Means
A local model beats Sonnet on productivity. Gemma 31B at 4.11/5 outperforms Claude Sonnet 4.6 by 6.8% on our 18 family management evals — with no scaffolding, no prompt tricks, no retry logic. Just a larger model with the same focused prompt.
Model scale matters more than engineering tricks. Gemma 26B→31B: +15%. Best scaffolding on 26B: +3.1%. Retry on 31B: -1%. When a model is large enough to reason through multi-step chains, extra instructions just add noise. Per Braintrust’s guidance on using eval data to drive architecture decisions: the data says skip scaffolding, scale the model.
Scaffolding helps weak models, hurts strong ones. Retry helped Gemma 26B (+3.1%) but hurt Gemma 31B (-1%) and Qwen (-8.5%). Validation helped Gemma 26B but hurt Qwen. The threshold appears to be around model capability — once a model can reason through the chain independently, scaffolding competes for attention budget. Following Eugene Yan’s approach: test per-model, not per-technique.
MiniMax M2.7 is a viable Sonnet alternative. At 3.80/5 (-1.3%), it’s within noise of Sonnet at ~3x lower cost. Good for vendor diversification and as a fallback when Anthropic has outages.
Don’t trust public benchmarks for multi-step tasks. Two models with 95%+ benchmark scores failed our evals. Run your own — the infrastructure to do so is the most valuable thing we’ve built.
Infrastructure quality was still the biggest lever. Scorer fixes: +39%. Model scale (26B→31B): +15%. Best scaffolding: +3.1%. Invest in eval infrastructure first, model selection second, prompt engineering last.
Part of the agent quality experiment series. Eval methodology follows our production eval reference. Eval tasks available at /ref/evals/.