06/14

Local LLM Stack

LIVE SINCE 2026-04 LATEST 2026-05-15 CATEGORY · PERSONAL INFRA

[ VIDEO ] local-llm-stack.mp4 — coming soon

16:9 · placeholder

WHAT IT DOES§ 01

A private AI cloud running on two Mac Studios in the house. Daily routine queries, morning briefings, voice transcription, image OCR, and embeddings all run locally at zero per-call cost — and stay on the LAN, so the Chief of Staff and personal wiki can answer questions about my life without anything leaving the house.

HOW IT'S BUILT§ 02

Workload split across two Mac Studios along a "kitchen / pantry" line — one host handles the chat UI, vector DB, embeddings, TTS, OCR, Whisper, indexing, and self-healing health checks; the other runs the heavy generative models (Gemma 4 26B-A4B, Qwen 3.6 35B-A3B, Gemma 4 E4B) via MLX.

Four-layer architecture: launchd ingestion daemons pull personal data to a NAS; a nightly indexer chunks content with Chonkie, embeds via local Qwen3-Embedding-4B (2560-dim), and writes to a handful of Chroma collections (~64k docs); inference flows through an llm-proxy front door with per-request InfluxDB cost tracking; Open WebUI surfaces it as a personal persona with native MCP tool servers for retrieval.

Operations layer runs Voxtral TTS, GLM-OCR, and Whisper STT alongside a self-healing health server with 300s cooldowns. External monitoring via Uptime Kuma with parity probes (/v1/models liveness + /v1/chat/completions readiness) on every backend lane.

TECH STACK§ 03

MLX
Apple Silicon
Open WebUI
Chroma
Gemma 4 26B-A4B
Qwen 3.6 35B-A3B
Whisper
GLM-OCR
Voxtral TTS
InfluxDB
Uptime Kuma
launchd
Docker

METRICS§ 04

~64kCHROMA DOCS

2MAC STUDIOS

3GENERATIVE MODELS

£0ROUTINE QUERY COST