Experiments

A series of experiments testing whether specialized AI agents on local models can match cloud API quality for personal task management.

EXP-004: Can the M1 Max Cross the Production Bar for Agent Serving? Apr 24, 2026
Throughput on the M1 Max via Ollama has been too slow for real agent traffic — the box has lived as a model-capability tester, not production hardware. This experiment asks whether oMLX improves tok/s, TTFT, and concurrent-request handling enough to cross that bar.
Building a Production Eval System for AI Agents Apr 7, 2026
What we learned building a quality measurement system for a multi-agent AI, drawing on practitioner wisdom from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org.