Tutorial: Self-Hosting DeepSeek Coder V3 on a Consumer GPU

DeepSeek Coder V3 has shattered the leaderboard. It outperforms GPT-4o and Claude 3.5 Sonnet on HumanEval, yet it’s fully open weights. Here is how to run it.

Why Self-Host?

Security: Your proprietary code never leaves your network.
Cost: One-time hardware cost vs $20/month/user.
Context: You can fine-tune it on your specific codebase.

Hardware Requirements

Minimum: NVIDIA RTX 3090/4090 (24GB VRAM) for the 33B model (4-bit quantized).
Recommended: NVIDIA RTX 5090 (32GB VRAM) for the 33B model at higher precision.
CPU: Almost irrelevant, just need system RAM (64GB recommended).

Step 1: Install LM Studio or Ollama

For beginners, LM Studio provides a nice GUI.

Download LM Studio for Linux/Windows.
Search for “DeepSeek-Coder-V3-33B-GGUF”.
Download the Q4_K_M (4-bit medium) quantization file (~20GB).

Step 2: VS Code Integration

You don’t want to chat in a separate window; you want it in your editor.

Install the “Continue” extension in VS Code.
Initial configuration:

"models": [
  {
    "title": "DeepSeek Local",
    "provider": "lmstudio",
    "model": "deepseek-coder-v3",
    "apiBase": "http://localhost:1234/v1"
  }
]

Step 3: Context Awareness

DeepSeek supports a massive 128k context window. In the Continue extension, add your entire src folder to the context. Note: This will eat VRAM. Use sparsely.

Performance Tuning

GPU Offload: Set to “Max” (all layers on GPU). If you split between CPU/GPU, speed drops from 50 tokens/sec to 5 tokens/sec.
Flash Attention: Ensure your backend supports Flash Attention 2 for 2x inference speed.

Conclusion

For the price of a high-end gaming PC, you get a world-class coding assistant that lives in your basement and reads your code securely.