I hit my Claude usage limit one afternoon and couldn’t figure out why. I hadn’t been working that hard. A few coding sessions, some writing, nothing that felt like it should have drained a daily allowance. So I did what I always do when a number doesn’t make sense: I went looking for the logs.
What I found is that my AI assistant had been keeping a very busy, very expensive bookkeeper on the payroll without telling me. This post is about how I fired that bookkeeper and handed the job to the most overqualified machine in my house: my ten-year-old’s Roblox computer.
The Hidden Cost of a Good Memory
I use claude-mem, a memory layer for Claude Code. It’s genuinely great: it watches what I do across sessions, extracts little “observations” about decisions and bug fixes, and summarizes each session when it ends. The next time I open a project, it hands the assistant a tidy briefing of everything we figured out last time. No more re-explaining my setup every morning.
But that magic isn’t free. Every one of those observations is itself an LLM call. Every session summary is another. When I actually counted, the background bookkeeping was running somewhere around 1,400 model calls a day—and all of them were billing against my Claude account, using the same quota I rely on for actual work.
That’s the trap with these convenience features: the foreground work is visible, but the background work is invisible until it shows up as a wall you’ve hit. My assistant was spending a meaningful chunk of my daily budget just taking notes about itself.
The thing is, this bookkeeping work doesn’t need a frontier model. Summarizing “we fixed the symlink config” or “the user prefers X” is a job a small, local model can do perfectly well. So why was I paying premium rates—and burning a finite quota—for it?
Enter the Mac Mini
Here’s the punchline: the machine I had in mind belongs to my ten-year-old. It’s an M4 Mac mini, and as far as its owner is concerned its entire purpose in life is running Roblox and watching YouTube. But that little M4 is the same chip the whole local-LLM crowd has been raving about—a genuinely capable inference machine with fast unified memory, the kind of thing people are buying specifically to run models at home. Mine just happened to come pre-loaded with a Minecraft obsession.
It runs 24/7 anyway, and right now it’s summer, so its rightful owner is off doing whatever ten-year-olds do with their days—which leaves the machine even more idle than usual. A kid’s gaming rig and a model doing quiet background work turn out to be very compatible roommates; even when she is parked in front of it, a 7-billion-parameter model jotting notes barely registers next to Roblox. So the plan was simple:
- Run Ollama on the Mac mini to serve a small local model.
- Point all of claude-mem’s background work at the Mac mini instead of the Claude API.
- Reach it over Tailscale, so it works from anywhere without exposing anything to the open internet.
Ollama exposes an OpenAI-compatible endpoint, which is the lingua franca these tools speak:
# On the Mac mini
brew install ollama
brew services start ollama
ollama pull qwen2.5:7b
That qwen2.5:7b model is the new bookkeeper. It’s small, it’s fast enough, and—this detail matters later—it only exists on my Mac mini. Nothing in the cloud serves that exact tag, which makes it a perfect fingerprint for verifying where my requests are actually going.
The Mac mini already runs with pmset sleep=0 so it never naps on the job, and Tailscale gives it a stable address on my private network. Total ongoing cost: the electricity it was already using.
The Catch: A Hardcoded URL
Here’s where it got interesting. I configured claude-mem to use its “OpenRouter” provider mode (OpenRouter being another OpenAI-compatible service), expecting to just override the base URL to my Mac mini.
There was no override. The endpoint was hardcoded in the worker source:
const endpoint = "https://openrouter.ai/api/v1/chat/completions";
No environment variable, no config key, no escape hatch. The tool assumed you’d only ever want to talk to OpenRouter’s servers.
When the polite door is locked, you use the window. I wrote a small patch script that rewrites that one constant to point at my Mac mini’s Tailscale address:
# patch-ollama-endpoint.sh — redirect the hardcoded endpoint
sed -i '' \
's|https://openrouter.ai/api/v1/chat/completions|http://<mac-mini-tailscale-ip>:11434/v1/chat/completions|' \
"$WORKER_SERVICE"
Then I set the provider config to feed it the right model and a dummy API key (Ollama ignores authentication anyway, which conveniently keeps my real keys off the wire entirely):
CLAUDE_MEM_PROVIDER=openrouter
CLAUDE_MEM_OPENROUTER_MODEL=qwen2.5:7b
CLAUDE_MEM_OPENROUTER_API_KEY=ollama # dummy; Ollama doesn't check
CLAUDE_MEM_OPENROUTER_MAX_TOKENS=8192
CLAUDE_MEM_CONTEXT_OBSERVATIONS=20 # was 50; trims session-start preload
That last setting was a nice side benefit. Dialing the number of observations injected at session start from 50 down to 20 cut my startup context from roughly 21k tokens to about 8k. Less noise for the assistant, faster starts for me.
Making It Survive Its Own Updates
A patched source file has an obvious weakness: the next time the tool updates, my change gets wiped and the bookkeeper quietly goes back to billing my Claude account. I’d be right back where I started, except now I wouldn’t be looking for it.
The fix is to make the patch self-healing. Claude Code lets you run a hook every time a session starts, so I wired one up to re-apply the patch idempotently:
{
"hooks": {
"SessionStart": [
{ "command": "~/.claude-mem/ensure-ollama-patch.sh" }
]
}
}
The script checks whether the endpoint is already pointed at the Mac mini. If it is, it does nothing. If an update reverted it, the script re-patches and bounces the worker so the change takes effect. Either way, every session I start silently guarantees the bookkeeper is still working for free. No calendar reminders, no manual babysitting.
Trust, but Verify
Rerouting your own traffic and assuming it worked is how you end up surprised by another usage wall. I wanted proof. This is where that distinctive model name earns its keep:
grep 'OpenRouter API usage' ~/.claude-mem/logs/claude-mem-*.log
OpenRouter API usage {model=qwen2.5:7b, inputTokens=3805,
outputTokens=992, totalTokens=4797, estimatedCostUSD=0.0263}
Because qwen2.5:7b only lives on my Mac mini, every log line naming it is a request that would have hit my Claude quota and instead went to the Roblox machine for free. (The estimatedCostUSD figure is fiction—it’s the tool guessing at OpenRouter’s prices. My real cost is zero.)
I even have Claude Code keeping a periodic eye on the whole arrangement now: a lightweight health check that confirms the Mac mini is reachable, the worker is alive, and fresh qwen2.5:7b calls are still flowing—pinging me only if something breaks. The bookkeeper has a supervisor, and I don’t have to be it.
A War Story, Free of Charge
No good infrastructure tale is complete without a yak-shave. The day I set this up, the Mac mini’s Ollama could list its models but refused to actually run any of them:
error starting llama-server: llama-server binary not found
The culprit was a half-finished Homebrew upgrade. The ollama 0.30.3 bottle had shipped incomplete—the llama-server binary that does the real inference work simply wasn’t in the package. An interrupted brew upgrade had stranded me on the broken version.
brew upgrade ollama # 0.30.3 -> 0.30.10
brew services restart ollama
The 0.30.10 bottle included the missing binary, and inference came right back. A good reminder that “it can list the models” and “it can run the models” are two very different claims.
Why This Was Worth Doing
I could have just paid for more quota. But this fits a pattern I keep coming back to, the same instinct behind running Whisper locally for captions or living in Taskwarrior instead of a SaaS to-do app: when a job can be done with hardware you already own, on data that never leaves your network, do it there.
The wins stack up nicely:
- Cost: ~1,400 daily calls moved from a metered account to free local compute.
- Privacy: my project context and session summaries get processed at home, not shipped to a third party.
- Resilience: a small local model handles the grunt work, freeing my Claude budget for the work that actually needs a frontier model.
- No lock-in: the whole thing is a config file, a one-line patch, and a hook. Easy to understand, easy to undo.
There’s a broader point here about these AI tools. They’re increasingly built on invisible background machinery—memory layers, summarizers, indexers—that quietly consume resources on your behalf. Convenient, until you’re the one paying for it without knowing. It’s worth occasionally asking where is this work actually happening, and who’s footing the bill?
In my case, the answer used to be “my Claude account, constantly.” Now it’s “a ten-year-old’s Roblox machine, for free.” That’s a much better deal—and the bookkeeper hasn’t filed a single complaint. Neither, for the record, has the ten-year-old, who remains blissfully unaware that an M4-class language model is moonlighting on her Mac mini between rounds of Blox Fruits.