Experiments
Series Proving Agent Quality With Data
A series of experiments testing whether specialized AI agents on local models can match cloud API quality for personal task management.
- EXP-004: Can the M1 Max Cross the Production Bar for Agent Serving?
Throughput on the M1 Max via Ollama has been too slow for real agent traffic — the box has lived as a model-capability tester, not production hardware. This experiment asks whether oMLX improves tok/s, TTFT, and concurrent-request handling enough to cross that bar.
- Building a Production Eval System for AI Agents
What we learned building a quality measurement system for a multi-agent AI, drawing on practitioner wisdom from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org.