DeepSeek V3/R1 Local Deployment: Run the Strongest Open-Source Models on Your Computer
No expensive H100 GPUs needed. Learn to deploy DeepSeek V3 and the reasoning model R1 locally with Ollama and vLLM.
In early 2026, one name dominated the AI world: DeepSeek.
This AI lab from China released DeepSeek-V3 and DeepSeek-R1, going head-to-head with GPT-4 and Claude 3.5 across various benchmarks—and more importantly, it’s fully open source. DeepSeek-R1’s exceptional reasoning capabilities make complex math and programming problems a breeze.
Today, we’re not just talking about it—we’re teaching you how to run it on your own machine.
Why Run DeepSeek Locally?
- Privacy: Your code, your documents—completely offline.
- Latency: No network delay; local inference speed depends on your GPU.
- No Censorship: Local models (usually) don’t have the strict moderation of cloud APIs.
- Free: No token costs beyond electricity.
Hardware Requirements
DeepSeek’s open-source versions offer various distilled model sizes, making them runnable on regular GPUs.
- DeepSeek-R1-Distill-Llama-8B:
- VRAM: ~6GB (4-bit quantized)
- Recommended: RTX 3060 / 4060
- DeepSeek-R1-Distill-Qwen-32B:
- VRAM: ~20GB (4-bit quantized)
- Recommended: RTX 3090 / 4090 or Mac M2/M3 Max (32GB+)
- DeepSeek-V3 (671B MoE):
- VRAM: Massive (multi-GPU H800 or high-memory Mac Studio). Normal users should use the API or distilled versions.
Method 1: Using Ollama (Simplest, Recommended)
Ollama is currently the most popular tool for running local LLMs.
1. Install Ollama
Go to ollama.com to download and install.
2. Run DeepSeek Models
Open terminal and choose a command based on your setup:
Run 8B version (works on most computers):
ollama run deepseek-r1:8b
Run 32B version (for 24GB VRAM or M-chip Mac):
ollama run deepseek-r1:32b
Run 70B version (for dual 3090/4090):
ollama run deepseek-r1:70b
3. Test Reasoning Capabilities
DeepSeek-R1’s signature feature is that it “thinks” (Chain of Thought). Try asking a logic puzzle:
“A pound of cotton and a pound of iron—which has larger volume? Reason step by step.”
You’ll see it first output content wrapped in <think> tags, showing its detailed reasoning process, then give a conclusion.
Method 2: Using vLLM (High-Performance Deployment)
If you’re a developer wanting to deploy a high-concurrency API service, vLLM is a better choice.
1. Install vLLM
pip install vllm
2. Start the Server
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--trust-remote-code \
--port 8000
3. Call the API
Now your local machine becomes an OpenAI-compatible API server:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
messages=[{"role": "user", "content": "Write a Python snake game"}]
)
print(response.choices[0].message.content)
DeepSeek vs Other Models
| Model | Strengths | Speed | Hardware Requirements |
|---|---|---|---|
| DeepSeek-V3 | General chat, multilingual | Fast (MoE) | High |
| DeepSeek-R1 | Math, coding, logical reasoning | Slow (long thinking) | Medium |
| Llama 3 | General chat, creative writing | Medium | Low |
Summary
DeepSeek’s emergence breaks the monopoly of closed-source models. R1’s reasoning capabilities prove that reinforcement learning (RL) can massively boost smaller models’ intellectual limits. Now, run this AI from the future right in your terminal.
DeepSeek R1’s thinking process and its transparency represent an important step toward explainable AI.