EXP-004: Can the M1 Max Cross the Production Bar for Agent Serving?
The Thesis
Throughput on the 64GB M1 Max (inference on my LAN) has been too slow for real agent traffic. Via Ollama it’s been a model-capability tester — fine for “can this model handle this prompt?” probes, unacceptable for anything an agent like Hank would route through on a live Slack channel. The agents I actually ship from (Rusty, Sheila, Hank) have been running on inference-2 with a dedicated GPU for exactly that reason.
A community post about oMLX, an MLX-based inference server, made me wonder if the ceiling was the tooling and not the hardware. This experiment asks one central question: does oMLX improve tok/s, TTFT, and concurrent-request handling enough to move the M1 Max from “testing rig” to “production inference box”?
The bar isn’t “marginally faster on a microbenchmark.” The bar is “fast enough that Hank answering a mention in the Practical AI Slack doesn’t feel broken — low TTFT, decent sustained decode, and no collapse under even modest concurrency.”
Three sub-questions in service of that:
- How much faster is oMLX than Ollama on this hardware? (If it’s single digits, the whole exercise is a wash.)
- Are the latest models (Qwen 3.6) a better fit for the server than the Gemma variants I’ve been testing?
- What configuration does oMLX actually need to deliver on its promise? (Spoiler: the defaults are a trap.)
The Answers
Does the M1 Max cross the production bar? Yes — with tuning, and with the right model for the right workload. With stock configuration it does not, and with Ollama it still doesn’t for concurrent multi-agent traffic. But the tuned oMLX + Qwen 3.6 35B-A3B combination puts Hank-style Slack traffic at 533ms median TTFT under 4× concurrent load and 70 aggregate tokens/sec — interactive feel, no collapse under bursts, good enough to graduate the box from tester to server for agent workloads. Not for everything. For Hank today.
On the three sub-questions:
1. oMLX is ~20% faster than Ollama on decode — but only because Ollama 0.20 now uses MLX kernels internally. Both are MLX-backed. The real difference is the server layer around them, not the math. Most of the “switch to MLX for 50% gains” narrative is already priced into modern Ollama.
2. Yes, Qwen 3.6 35B-A3B is the right pick. MoE with 3B active params means it decodes like a 3B model (43 t/s) while thinking like a 35B one, and it handles concurrent requests far better than any dense model I tested — 70 aggregate t/s under 4× load where dense Gemma collapsed to 24.
3. Stock oMLX’s marquee features are off by default. Hot cache disabled, paged SSD cache disabled, cache pool undersized. Flip three settings and a 10K-token prompt’s cold TTFT drops from 28 seconds to 2.3. Without the tuning, the “production worthy?” question answers itself in the wrong direction.
The table, tuned oMLX vs stock Ollama, same Gemma model, same prompts:
| metric | Ollama gemma4:26b | oMLX gemma (tuned) | oMLX qwen3.6 (tuned) |
|---|---|---|---|
| Cold decode t/s | 45 | 51 (+13%) | 43 |
| Cold TTFT | 353 ms | 170 ms (−52%) | 143 ms |
| 10K-prefix cold TTFT | 21,977 ms | 2,272 ms (−90%) | 3,640 ms |
| 10K-prefix warm TTFT | 545 ms | 1,935 ms | 3,415 ms |
| Sustained decode t/s (800 tok) | 39 | 50 (+28%) | 43 |
| 4×concurrent aggregate t/s | 38 | 24 | 70 |
| 4×concurrent median TTFT | 3,868 ms | 708 ms | 533 ms |
Ollama still wins warm-prefix caching by a wide margin, which matters for repeat-turn agents. oMLX wins cold-long-prefix and concurrency, which matters for bursty multi-agent traffic and fresh-context coding tasks. The workload decides.
How We Got There
First run: impossibly fast Ollama
Bench said 574 tokens/sec of decode on a 26B model. That's physically impossible on an M1 Max. The issue was in my harness, not the backends: Gemma 4 is a reasoning model, and Ollama's OpenAI-compat endpoint delivers its output in a reasoning field rather than content. My parser checked content, found empty strings, so TTFT never registered — and total_time − TTFT collapsed to milliseconds, giving a fake rate 30× too high.
Harness fixed: oMLX only ~20% faster on decode
With content-plus-reasoning counted, both backends benchmarked plausibly. 51 t/s on oMLX vs 46 on Ollama, same Gemma model. The marketing numbers I'd seen claimed 30–50% gains — why only 20%? Because Ollama 0.20.2 (the current release) already uses MLX kernels under the hood; the "MLX preview" that shipped in 0.19 became the default in 0.20. This was never MLX-vs-llama.cpp. It was two MLX-based servers, differing only in their scheduling, caching, and batching layers.
The marquee feature didn't show up
oMLX's headline is a two-tier KV cache (RAM hot + SSD cold) that's supposed to make warm-prefix TTFT tiny. But on a 2KB system prompt, oMLX's warm TTFT was 1,820 ms. Ollama's was 405 ms — four times faster. That's the opposite of the pitch. So either my test was wrong, or something in oMLX wasn't on.
The config audit
Looking at ~/.omlx/settings.json after install: hot_cache_max_size: "0" (disabled), ssd_cache_dir: null (disabled), initial_cache_blocks: 256 (tiny). Reading omlx serve --help confirmed it: the RAM hot cache default is 0 (disabled) and the paged SSD cache only activates if you set --paged-ssd-cache-dir explicitly. The features oMLX is marketed on are all off unless you opt in.
The silver-bullet setting
I enabled both caches and reran. Warm-prefix TTFT barely budged — but cold TTFT on the 10K-token scenario collapsed: 28.4 seconds → 2.3 seconds on Gemma, 21.9s → 3.6s on Qwen. The lever wasn't the caches; it was initial_cache_blocks, bumped 256 → 1,024. Preallocate enough blocks up front and oMLX can prefill long prompts without waiting on cascading dynamic allocations. Leave it at 256 and every long prompt spends 25+ seconds re-growing the pool before generating token one.
Qwen 3.6 35B-A3B under load
Single-request, Qwen at 43 t/s isn't exciting. Under 4 concurrent requests it hit 70 aggregate tokens/sec at 533 ms median TTFT — nearly double Ollama's best concurrent throughput on the same box. The 3B-active-params architecture lets different experts serve different requests without much contention. This is the number that actually answers the graduation question.
The Method
A Python harness hitting OpenAI-compatible /v1/chat/completions on both backends with streaming responses, running localhost-to-localhost on the Mac to eliminate network variance. Five scenarios per target:
- cold_short — fresh request, small prompt, 200-token generation. Baseline decode rate + TTFT.
- warm_prefix — 2KB system prompt, 5 different short user turns. Measures short-prefix cache effectiveness on turns 2–5.
- long_prefix — ~10K-token static prefix (realistic agent: role spec, tool catalog of 10 tools, context memory, prior turns), 5 different short user turns. Measures whether the KV cache engages on agent-scale prompts.
- long_decode — 800-token sustained generation. Decode rate under real output length.
- concurrency_4 — 4 parallel requests of the cold_short shape. Measures throughput under load + per-request TTFT fairness.
Each scenario runs with identical sampling (temperature=0, matched max_tokens) and both backends get a double warmup before measurement so nobody pays the first-request model-load tax. The judgment is whatever the measurement says, with one caveat: I didn’t re-run after every config change, so the numbers in the big table are from one run per configuration. That’s enough to distinguish 28 seconds from 2.3, but not enough to call a 5% difference significant.
Practitioner principle worth naming: this is a throughput benchmark, not an eval. It measures how fast the server moves tokens, not whether the tokens are good. The next step — which I deliberately skipped here — is point Hank at Qwen 3.6 and look at real trace quality over a week. Speed without quality is a regression.
What We Shipped
- oMLX installed via
brew tap jundot/omlx && brew install omlx, bound to the LAN, withhot_cache_max_size=16GB,initial_cache_blocks=1024,max_concurrent_requests=16,ssd_cache_dir=~/.omlx/cache. - Qwen 3.6 35B-A3B 4-bit MLX and Gemma 4 26B-A4B 4-bit MLX downloaded to
~/.omlx/models/. - Hank (the community-learning agent for the Practical AI Slack workspace) swapped from Gemma 26B on inference-2 to Qwen 3.6 on the M1 Max, via three env var changes. The model runs, the soul-prompt honors instructions, and his answers look like Hank. Whether they rank like Hank across evals is the next question.
What’s Still Open
- Ollama’s warm-prefix cache is genuinely better than oMLX’s at short and long prefixes once both are hot — 4× faster warm-TTFT on the 10K case. For sequential agents with stable system prompts (most of mine), Ollama remains the better backend. oMLX wins under concurrency and on fresh long contexts.
- The oMLX defaults are a footgun. With hot cache off and
initial_cache_blocks=256, stock installs look dramatically slower than they need to be. Worth filing upstream. - Real eval quality for Qwen 3.6 on agent tasks is unknown. The next experiment is running Hank’s task suite against this config and comparing scores to the Gemma baseline. This post settled speed. It did not settle quality.
The raw experiment record lives at archie-core experiments as EXP-004.