Whisper.cpp Tutorial: Ultimate Speech Recognition That Runs on CPU

If OpenAI Whisper is the crown jewel of speech recognition, then Whisper.cpp pries that jewel off and sets it into everyone’s keychain.

Whisper’s official implementation relies on PyTorch—extremely VRAM-hungry and slow. Georgi Gerganov’s whisper.cpp is completely rewritten in C/C++, has zero dependencies, and runs smoothly even on iPhones and Raspberry Pis.

Why Choose Whisper.cpp?

Zero Dependencies: No PyTorch, no Python, not even a GPU needed.
Extreme Performance: Leveraging Apple Silicon’s NEON instructions and AVX2 optimizations, an M1 Mac transcribes 1 minute of audio in just 1 second.
Cross-Platform: Linux, Mac, Windows, iOS, Android, WebAssembly… if you can think of it, it runs there.

Installation Guide (macOS/Linux)

1. Clone the Repository

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

2. Download Models

Whisper.cpp uses quantized ggml model format. The script automatically downloads and converts for you.

# Download base English model (~140MB)
bash ./models/download-ggml-model.sh base.en

# Or download multilingual version (supports Chinese)
bash ./models/download-ggml-model.sh small

3. Compile

make

That’s it. After compilation, you’ll see a main executable in the directory.

Hands-On: Speech-to-Text

Prepare a test.wav file containing speech (must be 16kHz sample rate).

# Run transcription
./main -m models/ggml-small.bin -f test.wav -l en

Parameter Explanation:

-m: Specify model file.
-f: Input audio file.
-l en: Specify language as English (if not specified, it auto-detects, but specifying is faster).
-otxt: Output as txt file (also supports -ovtt WebVTT, -osrt SRT subtitles).

Advanced: Real-Time Dictation

Whisper.cpp provides a stream tool that uses your microphone for real-time dictation.

# Compile stream tool
make stream

# Start real-time dictation
./stream -m models/ggml-small.bin -l en --step 500 --length 5000

Now speak into your microphone, and text appears in real-time in your terminal!

Technical Deep Dive: Why So Fast?

Whisper.cpp’s magic lies in the GGML tensor library (also the core of llama.cpp).

4-bit / 8-bit Quantization: Dramatically reduces model size and memory bandwidth requirements.
SIMD Optimization: Hand-written assembly-level optimizations for ARM NEON and x86 AVX2.
Hybrid Compute: On Apple Silicon, it mixes CPU and Neural Engine usage.

Use Cases

Podcast Subtitles: A 1-hour podcast transcribes in 1-2 minutes on an M2 Mac.
Meeting Notes: Combined with stream mode, build a completely private meeting transcription assistant.
Edge Computing: Add voice command recognition to Raspberry Pi monitoring devices.

Whisper.cpp redefines what’s possible with “edge AI.” It shows us: not all AI needs expensive H100s—excellent code optimization is equally productive.