Local LLM Stack
A private AI cloud running on two Mac Studios in the house. Daily routine queries, morning briefings, voice transcription, image OCR, and embeddings all run locally at zero per-call cost — and stay on the LAN, so the Chief of Staff and personal wiki can answer questions about my life without anything leaving the house.
Workload split across two Mac Studios along a "kitchen / pantry" line — one host handles the chat UI, vector DB, embeddings, TTS, OCR, Whisper, indexing, and self-healing health checks; the other runs the heavy generative models (Gemma 4 26B-A4B, Qwen 3.6 35B-A3B, Gemma 4 E4B) via MLX.
Four-layer architecture: launchd ingestion daemons pull personal data to a NAS; a nightly indexer chunks content with Chonkie, embeds via local Qwen3-Embedding-4B (2560-dim), and writes to a handful of Chroma collections (~64k docs); inference flows through an llm-proxy front door with per-request InfluxDB cost tracking; Open WebUI surfaces it as a personal persona with native MCP tool servers for retrieval.
Operations layer runs Voxtral TTS, GLM-OCR, and Whisper STT alongside a self-healing health server with 300s cooldowns. External monitoring via Uptime Kuma with parity probes (/v1/models liveness + /v1/chat/completions readiness) on every backend lane.
- MLX
- Apple Silicon
- Open WebUI
- Chroma
- Gemma 4 26B-A4B
- Qwen 3.6 35B-A3B
- Whisper
- GLM-OCR
- Voxtral TTS
- InfluxDB
- Uptime Kuma
- launchd
- Docker