<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Sumit Gupta</title><description>Writing about software, AI agents, and home infrastructure.</description><link>https://sumit.dev/</link><item><title>Proving Agent Quality With Data</title><link>https://sumit.dev/blog/agent-quality-series/</link><guid isPermaLink="true">https://sumit.dev/blog/agent-quality-series/</guid><description>A series of experiments testing whether specialized AI agents on local models can match cloud API quality for personal task management.</description><pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores?</title><link>https://sumit.dev/blog/agent-specialization-experiment/</link><guid isPermaLink="true">https://sumit.dev/blog/agent-specialization-experiment/</guid><description>We tested whether extracting media tools from a 46-tool agent into a 7-tool specialist would improve quality — and whether Gemma 26B could replace Sonnet on the focused domain.</description><pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate></item><item><title>Building a Production Eval System for AI Agents</title><link>https://sumit.dev/blog/building-agent-eval-system/</link><guid isPermaLink="true">https://sumit.dev/blog/building-agent-eval-system/</guid><description>What we learned building a quality measurement system for a multi-agent AI, drawing on practitioner wisdom from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org.</description><pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate></item><item><title>EXP-002: Do Mock Evals Predict Real-World Agent Quality?</title><link>https://sumit.dev/blog/exp002-real-api-robustness/</link><guid isPermaLink="true">https://sumit.dev/blog/exp002-real-api-robustness/</guid><description>We ran Henchman 21 against real media APIs 42 times to test whether our mock-based eval scores hold up in production conditions.</description><pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate></item><item><title>EXP-003: Does Agent Specialization Replicate for Productivity Tasks?</title><link>https://sumit.dev/blog/exp003-sheila-productivity/</link><guid isPermaLink="true">https://sumit.dev/blog/exp003-sheila-productivity/</guid><description>The 40% improvement from media specialization only partially replicates for email/calendar. And Gemma 26B hits a wall on multi-step productivity chains.</description><pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate></item></channel></rss>