ronreiter@github ~/blog$
← cd ..

Running Claude Code and Pi on DeepSeek V4 Flash — locally on a 128GB MacBook Pro

#llm#local-ai#deepseek#claude-code#macos

A 284-billion-parameter frontier model, running entirely offline on a laptop — and wired up as a backend for two agent harnesses: Claude Code and Pi.

DeepSeek V4 Flash dropped in April 2026: a 284B-parameter Mixture-of-Experts model (13B active per token), MIT-licensed, with a 1M-token context window. The interesting part for me wasn’t the benchmarks — it was the claim, floating around the internet, that you could run it locally on an Apple Silicon Mac with enough RAM.

I have a MacBook Pro with an M3 Max and 128GB of unified memory. So I tried it. Here’s everything that worked, everything that didn’t, and the scripts I ended up with.


TL;DR

  • It works. ~21 tokens/sec generation, fully on the Metal GPU, ~81GB resident.
  • You cannot use mainline llama.cpp or Ollama yet — the deepseek4 architecture isn’t merged. You need antirez’s experimental fork.
  • The model file is an 81GB 2-bit “Dwarf Star” quant from antirez/deepseek-v4-gguf, purpose-built for 128GB Macs.
  • llama-server now speaks the Anthropic Messages API natively, so you can point Claude Code at it with zero proxies.
  • 1M context loads but crashes at inference; 256k is the reliable ceiling on this fork.

The hardware

Chip:    Apple M3 Max (12 performance + 4 efficiency cores)
Memory:  128 GB unified

The 128GB is the whole ballgame. The 2-bit quant needs ~81GB resident, which means a 64GB machine is out — you’d swap to death or OOM. 128GB is the sweet spot the quant was designed around. (There’s a bigger Q4 variant at 153GB for the 192GB Mac Studios, and DeepSeek-V4-Pro quants too, but Flash-q2 is the one that fits a laptop.)

False start: the guide that didn’t work

I started from a tutorial that told me to git clone mainline llama.cpp, build it, and huggingface-cli download <some-repo>/deepseek-v4-flash. Two problems:

  1. Mainline llama.cpp doesn’t support DeepSeek V4. The deepseek4 architecture — with its sparse attention, hyper-connections, and multi-token-prediction head — isn’t in stable releases. Ollama doesn’t support it either (it’ll auto-update once the arch merges upstream, but that hadn’t happened).
  2. The download command had a literal placeholder for the repo. There was no real source behind it.

So if you find a tutorial telling you to use stock llama.cpp or ollama pull deepseek-v4, close the tab. As of mid-2026 that path does not exist.

What actually works: antirez’s fork

Salvatore “antirez” Sanfilippo (creator of Redis) maintains an experimental llama.cpp fork that implements the deepseek4 architecture, plus a HuggingFace repo of matching GGUF quants. The key file:

DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf   (81 GB)

That filename is a recipe. It’s IQ2_XXS (2-bit) for the routed experts — which is where almost all 284B parameters live — but keeps the attention projections, shared experts, and output layer at Q8. The parts that matter for coherence stay high-precision; the giant sparse expert tables get crushed to 2 bits. antirez calls it the “Dwarf Star” quant. His own note: “behaves very very well in the chat, frontier-model vibes, but it was not extensively tested.” That matches my experience.

Building it is standard llama.cpp:

git clone --depth 1 https://github.com/antirez/llama.cpp-deepseek-v4-flash llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

This gives you llama-cli, llama-server, and llama-completion. The build detected my M3 Max GPU correctly:

ggml_metal_device_init: GPU name:   MTL0 (Apple M3 Max)
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB

That ~115GB working-set ceiling is the number to keep in mind: the model eats ~83GB of it, leaving ~32GB for context and compute buffers.

Things that nearly fooled me

”It’s running on the CPU!” (it wasn’t)

My first test generation seemed to hang. top showed the process pegged at 99% on a single core for 19 minutes with no output. I was convinced the custom DeepSeek ops (the sparse-attention “indexer”, the “compressor”) had no Metal kernels and were falling back to CPU.

They weren’t. Two things were happening:

  1. I’d piped the output through tail, which buffers until the process exits — so I saw nothing while it generated fine.
  2. The 99%-single-core is just the orchestration thread spinning while the GPU does the matmuls. The real proof came from the memory breakdown:
| memory breakdown [MiB] | total   free    self    ... |
| MTL0 (Apple M3 Max)    | 110100 = 26265 + (83161 ...) |

83GB sitting on MTL0 — the Metal GPU. It was on the GPU the whole time. Lesson: don’t pipe a streaming LLM through tail, and check the memory breakdown before blaming the CPU.

Speed and load time

  • Generation: ~21 tok/s. Prompt eval: ~32–43 tok/s.
  • Cold load: ~9 minutes (reading 81GB off disk). Warm load: ~4 seconds once the file is in the OS page cache. So your second launch is dramatically faster than your first.

How big can the context actually be?

The model supports 1M tokens. The question is what fits and computes in ~32GB of leftover working set. I measured it empirically — and DeepSeek’s sparse attention makes the KV cache shockingly cheap (sliding-window of 128 + a top-512 indexer, instead of dense full-sequence attention):

ContextTotal residentResult
2k~82 GB✅ (KV cache only ~66 MiB)
64k~83 GB
256k~88–91 GB— this is the one I settled on
1M~85 GB (loads)Compute error at inference time

So memory was never the limit — even 1M loads in 85GB. But at 1M the fork fails to build the compute graph and every request returns {"error":{"code":500,"message":"Compute error."}}. 256k computes reliably, is larger than hosted Claude’s standard 200k window, and leaves headroom. That’s what I bake into the server.

Wiring it into Claude Code

This was the surprise payoff. Recent llama-server exposes an Anthropic Messages API endpoint (/v1/messages) alongside the OpenAI one — so no proxy, no claude-code-router, no LiteLLM needed. You point Claude Code straight at llama-server.

A raw test against the endpoint:

curl -s http://127.0.0.1:8080/v1/messages \
  -H "content-type: application/json" -H "anthropic-version: 2023-06-01" \
  -d '{"model":"deepseek-v4-flash","max_tokens":40,
       "messages":[{"role":"user","content":"Reply with exactly: BRIDGE OK"}]}'
{"type":"message","role":"assistant",
 "content":[{"type":"thinking","thinking":"..."},{"type":"text","text":"BRIDGE OK"}],
 "stop_reason":"end_turn","usage":{"cache_read_input_tokens":0,"input_tokens":12,"output_tokens":37}}

Proper Anthropic-shaped response, thinking blocks and all — and note cache_read_input_tokens, so prompt caching works too. Two things to get right:

  • Start the server with --jinja or tool/function calling won’t work (Claude Code lives and dies by tool calls).
  • Do NOT put ANTHROPIC_BASE_URL in your global ~/.claude/settings.json. That hijacks every claude you run — including your normal cloud sessions. Set the env vars in a launcher script instead, so it’s opt-in per invocation.

The end-to-end proof: I ran Claude Code headless against the local model and asked it to reply LOCAL CLAUDE OK. The server log showed it ingesting a 20,556-token prompt (Claude Code’s system prompt + tool schemas), and after chewing through it… LOCAL CLAUDE OK. 🎉

The honest caveat: that 20k-token system prompt takes several minutes to process on the first turn at ~32 tok/s. Prompt caching makes later turns faster, but this is not a snappy daily driver. It’s a 2-bit model on a laptop. It’s genuinely useful for offline/air-gapped work and experimentation; it is not going to feel like the hosted product.

Bonus: the same model in Pi (a second harness)

Pi is a minimal, provider-agnostic coding agent (@earendil-works/pi-coding-agent). Since it advertises an OpenAI provider and reads OPENAI_API_KEY, I assumed I could just set OPENAI_BASE_URL=http://localhost:8080/v1 and run pi --provider openai. That doesn’t work — Pi’s built-in openai provider ignores OPENAI_BASE_URL and goes straight to api.openai.com:

OpenAI API error (401): Incorrect API key provided: local.
You can find your API key at https://platform.openai.com/account/api-keys.

The correct way to point Pi at a local server is to register a custom provider via a tiny extension (pi.registerProvider). Once that’s loaded, pi --list-models shows your local model and everything routes locally:

provider  model              context  max-out
local     deepseek-v4-flash  262.1K   8.2K

A quick pi -p "What is 2+2?" returns 4 — through the local model, fully offline. Pi’s system prompt is much smaller than Claude Code’s, so it feels noticeably snappier on the same hardware (less prompt to chew through each turn).


The scripts

Everything lives in ~/deepseek-v4-flash/. (I also have a setup.sh that installs prereqs, builds the fork, and downloads the model — omitted here for brevity; the interesting bits are below.)

serve.sh — run the model server in the background

Exposes both the Anthropic and OpenAI APIs on 127.0.0.1:8080, at 256k context, with --jinja for tool calling.

#!/usr/bin/env bash
# Start/stop the DeepSeek server (llama-server) on http://127.0.0.1:8080.
# Exposes /v1/messages (Anthropic) and /v1/chat/completions (OpenAI).
#   ./serve.sh [start|stop|status|logs]
set -euo pipefail
DIR="$HOME/deepseek-v4-flash"
BIN="$DIR/llama.cpp/build/bin/llama-server"
MODEL="$DIR/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf"
LOG="$DIR/server.log"; PORT=8080; HOST=127.0.0.1; CTX=262144   # 256k

case "${1:-start}" in
  stop)   pkill -f "[l]lama-server" && echo stopped || echo "not running" ;;
  status) curl -sf "http://$HOST:$PORT/health" >/dev/null 2>&1 \
            && { echo "UP http://$HOST:$PORT"; ps -axo rss,command | grep "[l]lama-server" \
                 | awk '{printf "  %.1f GB\n",$1/1048576}'; } \
            || echo DOWN ;;
  logs)   tail -f "$LOG" ;;
  start)
    curl -sf "http://$HOST:$PORT/health" >/dev/null 2>&1 && { echo "already running"; exit 0; }
    pkill -f "[l]lama-server" 2>/dev/null || true; sleep 1; : > "$LOG"
    nohup "$BIN" -m "$MODEL" -ngl 99 -c "$CTX" --jinja --host "$HOST" --port "$PORT" >> "$LOG" 2>&1 &
    echo "starting (pid $!), ctx=$CTX — loading ~81GB, takes a few minutes"
    printf "waiting"
    until curl -sf "http://$HOST:$PORT/health" >/dev/null 2>&1; do
      pgrep -f "[l]lama-server" >/dev/null || { echo " FAILED — see $LOG"; exit 1; }
      printf .; sleep 3
    done
    echo " UP on http://$HOST:$PORT" ;;
  *) echo "usage: $0 {start|stop|status|logs}"; exit 1 ;;
esac

Key flags: -ngl 99 offloads all layers to Metal, -c 262144 sets the 256k window, --jinja enables tool calling.

claude-local.sh — run Claude Code against the local model

The whole trick is here: set the ANTHROPIC_* env vars for this invocation only, auto-starting the server if it’s down. Your normal cloud claude in other tabs is untouched.

#!/usr/bin/env bash
# Launch Claude Code against the LOCAL DeepSeek server (this invocation only;
# your normal cloud `claude` is untouched). Args are forwarded to claude.
set -euo pipefail
DIR="$HOME/deepseek-v4-flash"; HOST=127.0.0.1; PORT=8080
command -v claude >/dev/null 2>&1 || { echo "Claude Code CLI not found."; exit 1; }
curl -sf "http://$HOST:$PORT/health" >/dev/null 2>&1 || { echo "starting local server..."; "$DIR/serve.sh" start; }

export ANTHROPIC_BASE_URL="http://$HOST:$PORT"
export ANTHROPIC_API_KEY="local-no-auth"          # server ignores auth; this just skips the login flow
export ANTHROPIC_AUTH_TOKEN="local-no-auth"
export ANTHROPIC_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash"   # route the small/fast model locally too
export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1  # no telemetry / update pings — fully local
exec claude "$@"

Then, in any new terminal tab:

~/deepseek-v4-flash/claude-local.sh

chat.sh — plain terminal chat (no Claude Code)

For a quick conversation without the agent harness. The -cnv flag runs interactive conversation mode.

#!/usr/bin/env bash
# Interactive terminal chat with DeepSeek V4 Flash.
set -euo pipefail
DIR="$HOME/deepseek-v4-flash"
exec "$DIR/llama.cpp/build/bin/llama-cli" \
  -m "$DIR/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf" \
  -ngl 99 -c 8192 -cnv "$@"

pi-local-provider.js — register the local server as a Pi provider

Pi won’t honor OPENAI_BASE_URL, so we register a custom local provider in an extension. api: "openai-completions" matches llama-server’s OpenAI endpoint.

// Load with: pi -e ~/deepseek-v4-flash/pi-local-provider.js --provider local --model local/deepseek-v4-flash
export default async function (pi) {
  pi.registerProvider("local", {
    baseUrl: "http://127.0.0.1:8080/v1",
    apiKey: "local-no-auth",          // llama-server ignores auth; any non-empty value works
    api: "openai-completions",
    models: [
      {
        id: "deepseek-v4-flash",
        name: "DeepSeek V4 Flash (local, q2)",
        reasoning: false,
        input: ["text"],
        cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
        contextWindow: 262144,
        maxTokens: 8192,
      },
    ],
  });
}

pi-local.sh — run Pi against the local model

Install Pi first (curl -fsSL https://pi.dev/install.sh | sh, or brew/npm), then:

#!/usr/bin/env bash
# Launch the Pi coding agent against the LOCAL DeepSeek server.
# Auto-starts the model server if needed. Args are forwarded to pi.
set -euo pipefail
DIR="$HOME/deepseek-v4-flash"; HOST=127.0.0.1; PORT=8080
command -v pi >/dev/null 2>&1 || { echo "Pi not installed — see https://pi.dev"; exit 1; }
curl -sf "http://$HOST:$PORT/health" >/dev/null 2>&1 || { echo "starting local server..."; "$DIR/serve.sh" start; }
exec pi -e "$DIR/pi-local-provider.js" --provider local --model local/deepseek-v4-flash "$@"

Then, in any terminal:

~/deepseek-v4-flash/pi-local.sh                       # interactive
~/deepseek-v4-flash/pi-local.sh -p "explain this repo"   # one-shot

Would I actually use this?

For day-to-day coding? No — the hosted models are an order of magnitude faster and smarter. But as a demonstration that a 284B frontier-class MoE runs offline on a laptop, and that you can drive Claude Code with zero cloud dependency, it’s remarkable. Air-gapped environments, flights, privacy-sensitive work, or just the sheer “because I can” factor — that’s where this shines.

The pieces that made it possible — antirez’s architecture port, a 2-bit quant that keeps the right layers precise, DeepSeek’s sparse attention keeping the KV cache tiny, and llama-server’s native Anthropic endpoint — are each individually clever. Stacked together, they put a frontier model on my lap. Literally.

Built and tested on macOS, Apple M3 Max, 128GB, June 2026.