Industrial-grade speech recognition: 170x realtime, 50+ languages, speaker diarization, emotion detection.
Project description
Industrial speech recognition. 170x faster than Whisper. 50+ languages.
Speaker diarization · Emotion detection · Streaming · One API call
Quick Start · Colab · Benchmark · Model selection · Migration guide · Use cases · Deployment matrix · Models · Agent Integration · Docs · Contribute
Quick Start
No local setup? Open the Colab quickstart to transcribe a public sample or upload your own audio in a browser.
pip install torch torchaudio
pip install funasr
Flagship model — Fun-ASR-Nano (LLM-ASR, 31 languages; the default recommendation, needs a GPU):
from funasr import AutoModel
model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", device="cuda")
result = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav")
print(result[0]["text"])
# 欢迎大家来体验达摩院推出的语音识别模型。
On CPU (or for multilingual + emotion in one pass), use SenseVoice — which also returns speaker diarization and timestamps:
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda") # use device="cpu" if you don't have a GPU
result = model.generate(
input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
batch_size_s=300,
)
# One call returns VAD segments with speaker id + timestamps — render them however you like:
for seg in result[0]["sentence_info"]:
print(f"[{seg['start']/1000:.1f}s] Speaker {seg['spk']}: {rich_transcription_postprocess(seg['sentence'])}")
Output — structured text with speaker labels, timestamps, and punctuation:
[0.6s] Speaker 0: 欢迎大家来体验达摩院推出的语音识别模型
That's it. One model, one call — VAD segmentation, speech recognition, punctuation, speaker diarization all happen automatically.
Scale & deploy the flagship
At scale, accelerate Fun-ASR-Nano with vLLM (batch processing):
from funasr.auto.auto_model_vllm import AutoModelVLLM
model = AutoModelVLLM(model="FunAudioLLM/Fun-ASR-Nano-2512", tensor_parallel_size=1)
results = model.generate(["audio1.wav", "audio2.wav"], language="auto")
Deploy as API server:
funasr-server --device cuda→ OpenAI-compatible endpoint at localhost:8000Use with AI agents: MCP Server for Claude/Cursor · OpenAI API for LangChain/Dify/AutoGen
Why FunASR?
| FunASR | Whisper | Cloud APIs | |
|---|---|---|---|
| Speed | 170x realtime | 13x realtime | ~1x realtime |
| Speaker ID | ✅ Built-in | ❌ Needs pyannote | ✅ Extra cost |
| Emotion | ✅ Happy/Sad/Angry | ❌ | ❌ |
| Languages | 50+ | 57 | Varies |
| Streaming | ✅ WebSocket | ❌ | ✅ |
| vLLM Acceleration | ✅ up to 16x faster | ❌ | N/A |
| Self-hosted | ✅ MIT license | ✅ MIT license | ❌ Cloud only |
| Cost | Free | Free | $0.006/min+ |
| CPU viable | ✅ 17x realtime | ❌ Too slow | N/A |
Trying FunASR for the first time? Use the Colab quickstart before setting up a local environment. Choosing a first model? Start with the model selection guide. Planning a switch from Whisper or a cloud ASR provider? Use the migration guide and benchmark example to test representative audio, map features, and roll out safely.
Benchmark
184 long-form audio files (192 min). Full report →
| Model | Chinese CER ↓ | GPU Speed | CPU Speed | vs Whisper-large-v3 |
|---|---|---|---|---|
| Fun-ASR-Nano (vLLM) | 8.20% | 340x realtime | — | 🚀 26x faster |
| SenseVoice-Small | 7.81% | 170x realtime | 17x realtime | 🚀 13x faster |
| Paraformer-Large | 10.18% | 120x realtime | 15x realtime | 🚀 9x faster |
| Whisper-large-v3-turbo | 21.71% | 46x realtime | ❌ | 3.4x faster |
| Whisper-large-v3 | 20.02% | 13x realtime | ❌ | baseline |
Key takeaway: FunASR models run on CPU faster than Whisper runs on GPU.
What's new
- 2026/06/20: llama.cpp / GGUF runtime — run SenseVoice / Paraformer / Fun-ASR-Nano on CPU & edge as a single self-contained binary (a whisper.cpp-style alternative), built-in FSMN-VAD, no Python at runtime. Prebuilt binaries for Linux / macOS / Windows + q8 quantized models (~half the size, same accuracy). runtime/llama.cpp/ · Releases
- 2026/06/21: v1.3.12 on PyPI — rolling fixes (qwen3-asr language codes, glm_asr, vLLM repetition_penalty).
pip install --upgrade funasr - 2026/05/24: vLLM Inference Engine — 2-3x faster LLM decoding for Fun-ASR-Nano. Streaming WebSocket service with VAD + Speaker Diarization. Guide →
- 2026/05/24: Dynamic VAD — adaptive silence threshold (default on). Short sentences stay intact, long segments get auto-split. Details →
- 2026/05/24: v1.3.3 —
funasr-serverCLI, OpenAI-compatible API, MCP Server for AI agents.pip install --upgrade funasr - 2026/05/20: Added Qwen3-ASR (0.6B/1.7B) — 52 languages, auto detection. usage
- 2026/05/20: Added GLM-ASR-Nano (1.5B) — 17 languages, dialect support. usage
- 2026/05/19: Fun-ASR-Nano and SenseVoice now support speaker diarization.
- 2025/12/15: Fun-ASR-Nano-2512 — 31 languages, tens of millions of hours training.
Older
- 2024/10/10: Whisper-large-v3-turbo support added.
- 2024/07/04: SenseVoice — ASR + emotion + audio events.
- 2024/01/30: FunASR 1.0 released.
Installation
pip install funasr
From source / Requirements
git clone https://github.com/modelscope/FunASR.git && cd FunASR
pip install -e ./
Requirements: Python ≥ 3.8. Install PyTorch + torchaudio first (pytorch.org), then pip install funasr.
Model Zoo
| Model | Task | Languages | Params | Links |
|---|---|---|---|---|
| Fun-ASR-Nano | ASR + timestamps | 31 languages | 800M | ⭐ 🤗 |
| SenseVoiceSmall | ASR + emotion + events | zh/en/ja/ko/yue | 234M | ⭐ 🤗 |
| Paraformer-zh | ASR + timestamps | zh/en | 220M | ⭐ 🤗 |
| Paraformer-zh-streaming | Streaming ASR | zh/en | 220M | ⭐ 🤗 |
| Qwen3-ASR | ASR, 52 languages | multilingual | 1.7B | usage |
| GLM-ASR-Nano | ASR, 17 languages | multilingual | 1.5B | usage |
| Whisper-large-v3 | ASR + translation | multilingual | 1550M | usage |
| Whisper-large-v3-turbo | ASR + translation | multilingual | 809M | usage |
| ct-punc | Punctuation | zh/en | 290M | ⭐ 🤗 |
| fsmn-vad | VAD | zh/en | 0.4M | ⭐ 🤗 |
| cam++ | Speaker diarization | — | 7.2M | ⭐ 🤗 |
| emotion2vec+large | Emotion recognition | — | 300M | ⭐ 🤗 |
Usage
Full examples with parameter docs: Tutorial →
from funasr import AutoModel
# Chinese production (VAD + ASR + punctuation + speaker)
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", spk_model="cam++", device="cuda")
result = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav", hotword="关键词 20")
# 31 languages with timestamps
model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512",
vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda")
result = model.generate(input="audio.wav", batch_size=1)
# Streaming real-time (feed audio chunk by chunk)
import soundfile as sf
model = AutoModel(model="paraformer-zh-streaming", device="cuda")
audio, sr = sf.read("speech.wav", dtype="float32") # 16 kHz mono
chunk_size = [0, 10, 5] # 600 ms chunks
chunk_stride = chunk_size[1] * 960
cache = {}
n_chunks = (len(audio) - 1) // chunk_stride + 1
for i in range(n_chunks):
chunk = audio[i * chunk_stride : (i + 1) * chunk_stride]
res = model.generate(input=chunk, cache=cache, is_final=(i == n_chunks - 1),
chunk_size=chunk_size, encoder_chunk_look_back=4, decoder_chunk_look_back=1)
if res[0]["text"]:
print(res[0]["text"], end="", flush=True)
# Emotion recognition
model = AutoModel(model="emotion2vec_plus_large", device="cuda")
result = model.generate(input="audio.wav", granularity="utterance")
CLI (Agent-Friendly)
# Transcribe audio (simplest)
funasr audio.wav
# JSON output (for AI agents)
funasr audio.wav --output-format json
# SRT subtitles
funasr audio.wav --output-format srt --output-dir ./subs
# Speaker diarization + timestamps
funasr audio.wav --spk --timestamps -f json
# Choose model and language
funasr audio.wav --model paraformer --language zh
# Batch transcribe
funasr *.wav --output-format srt --output-dir ./output
Available models: sensevoice (default), paraformer, paraformer-en, fun-asr-nano
Deploy
# OpenAI-compatible API (recommended)
pip install torch torchaudio
pip install funasr vllm fastapi uvicorn python-multipart
funasr-server --device cuda
# → POST /v1/audio/transcriptions at localhost:8000
Verify it with a public sample:
curl -L https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav -o sample.wav
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=sensevoice \
-F response_format=verbose_json
# Docker streaming service
docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.12
CPU / Edge — llama.cpp / GGUF (no GPU, no Python)
Run SenseVoice / Paraformer / Fun-ASR-Nano as a single self-contained binary on CPU and edge devices — this is to FunASR what whisper.cpp is to Whisper, but with ~3× lower CER than whisper.cpp on Chinese. Built-in FSMN-VAD, no Python at runtime.
# 1) Grab a prebuilt binary from Releases (Linux / macOS / Windows), then:
bash download-funasr-model.sh sensevoice ./gguf # or: paraformer | nano
llama-funasr-sensevoice -m ./gguf/SenseVoiceSmall-f16.gguf --vad ./gguf/fsmn-vad.gguf -a audio.wav
# → 欢迎大家来体验达摩院推出的语音识别模型
Prebuilt binaries: Releases · Download & quickstart: funasr.com/llama-cpp · GGUF models: Hugging Face · Docs & benchmarks: runtime/llama.cpp/
OpenAI API example → · Gradio demo → · Client recipes → · JavaScript/TypeScript recipes → · Kubernetes template → · Workflow recipes → · Postman collection → · OpenAPI spec → · Security guide → · Deployment matrix → · Deployment docs → · Agent integration →
Community
| 📖 Documentation | 🐛 Issues |
| 💬 Discussions | 🤗 HuggingFace |
| 🤝 Contributing | 🌐 funasr.com |
Star History
License
Citations
@inproceedings{gao2023funasr,
author={Zhifu Gao and others},
title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
booktitle={INTERSPEECH},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file funasr-1.3.14.tar.gz.
File metadata
- Download URL: funasr-1.3.14.tar.gz
- Upload date:
- Size: 750.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed214b60300f13470749956df0de9a1c9c213e53ceccf22b2e6f70b5fff5dfb
|
|
| MD5 |
d5d16d31e302a6d1a22ad3744acab702
|
|
| BLAKE2b-256 |
b85046a5f1b4eb369943bb6177490a57e368a6b66075d39727575f07a5813973
|
File details
Details for the file funasr-1.3.14-py3-none-any.whl.
File metadata
- Download URL: funasr-1.3.14-py3-none-any.whl
- Upload date:
- Size: 926.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd2d451a323ce0d1bda0566bc9ca4224b4429bf46d55882cf35ad482118173fd
|
|
| MD5 |
9670c4d7a7ae2f4b7f3f1fc1023f8f39
|
|
| BLAKE2b-256 |
cbf9cda21e7a12d12889774191267b0348379ed5ab8d894d13cd239acd4538dc
|