06/14

Local LLM Stack

LIVE SINCE 2026-04 LATEST 2026-05-15 CATEGORY · PERSONAL INFRA
WHAT IT DOES§ 01

A private AI cloud running on two Mac Studios in the house. Daily routine queries, morning briefings, voice transcription, image OCR, and embeddings all run locally at zero per-call cost — and stay on the LAN, so the Chief of Staff and personal wiki can answer questions about my life without anything leaving the house.

HOW IT'S BUILT§ 02

Workload split across two Mac Studios along a "kitchen / pantry" line — one host handles the chat UI, vector DB, embeddings, TTS, OCR, Whisper, indexing, and self-healing health checks; the other runs the heavy generative models (Gemma 4 26B-A4B, Qwen 3.6 35B-A3B, Gemma 4 E4B) via MLX.

Four-layer architecture: launchd ingestion daemons pull personal data to a NAS; a nightly indexer chunks content with Chonkie, embeds via local Qwen3-Embedding-4B (2560-dim), and writes to a handful of Chroma collections (~64k docs); inference flows through an llm-proxy front door with per-request InfluxDB cost tracking; Open WebUI surfaces it as a personal persona with native MCP tool servers for retrieval.

Operations layer runs Voxtral TTS, GLM-OCR, and Whisper STT alongside a self-healing health server with 300s cooldowns. External monitoring via Uptime Kuma with parity probes (/v1/models liveness + /v1/chat/completions readiness) on every backend lane.

TECH STACK§ 03
  • MLX
  • Apple Silicon
  • Open WebUI
  • Chroma
  • Gemma 4 26B-A4B
  • Qwen 3.6 35B-A3B
  • Whisper
  • GLM-OCR
  • Voxtral TTS
  • InfluxDB
  • Uptime Kuma
  • launchd
  • Docker
METRICS§ 04
~64kCHROMA DOCS
2MAC STUDIOS
3GENERATIVE MODELS
£0ROUTINE QUERY COST