ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
Project description
ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
This repository contains the implementation and supplementary materials for our ICASSP 2025 paper, "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription". The paper has been fully accepted by the reviewers with scores: 4/4/4.
https://github.com/user-attachments/assets/aba64174-f965-43f2-92a2-7391fb0dba5c
Table of Contents
Introduction
ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a chunk-wise processing mechanism with relative right context and employs the Masked Batch technique to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.
Key Features
- Transcribing Extremely Long Audio: ChunkFormer can transcribe audio recordings up to 16 hours in length with results comparable to existing models. It is currently the first model capable of handling this duration.
- Efficient Decoding on Low-Memory GPUs: Chunkformer can handle long-form transcription on GPUs with limited memory without losing context or mismatching the training phase.
- Masked Batching Technique: ChunkFormer efficiently removes the need for padding in batches with highly variable lengths. For instance, decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.
| GPU Memory | Total Batch Duration (minutes) |
|---|---|
| 80GB | 980 |
| 24GB | 240 |
Installation
Option 1: Install from PyPI (Recommended)
pip install chunkformer
Option 2: Install from source
# Clone the repository
git clone https://github.com/your-username/chunkformer.git
cd chunkformer
# Install in development mode
pip install -e .
Pretrained Models
| Language | Model |
|---|---|
| Vietnamese | |
| Vietnamese | |
| English | |
Usage
Feature Extraction
from chunkformer import ChunkFormerModel
import torch
device = "cuda:0"
# Load a pre-trained model from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie").to(device)
x, x_len = model._load_audio_and_extract_features("path/to/audio") # x: (T, F), x_len: int
x = x.unsqueeze(0).to(device)
x_len = torch.tensor([x_len], device=device)
# Extract feature
feature, feature_len = model.encode(
xs=x,
xs_lens=x_len,
chunk_size=64,
left_context_size=128,
right_context_size=128,
)
print("feature: ", feature.shape)
print("feature_len: ", feature_len)
Python API
Classification
ChunkFormer also supports speech classification tasks (e.g., gender, dialect, emotion, age recognition).
from chunkformer import ChunkFormerModel
# Load a pre-trained classification model from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("path/to/classification/model")
# Single audio classification
result = model.classify_audio(
audio_path="path/to/audio.wav",
chunk_size=-1, # -1 for full attention
left_context_size=-1,
right_context_size=-1,
)
print(result)
Transcription
from chunkformer import ChunkFormerModel
# Load a pre-trained encoder from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie")
# For single long-form audio transcription
transcription = model.endless_decode(
audio_path="path/to/long_audio.wav",
chunk_size=64,
left_context_size=128,
right_context_size=128,
total_batch_duration=14400, # in seconds
return_timestamps=True
)
print(transcription)
# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
audio_paths=audio_files,
chunk_size=64,
left_context_size=128,
right_context_size=128,
total_batch_duration=1800 # Total batch duration in seconds
)
for i, transcription in enumerate(transcriptions):
print(f"Audio {i+1}: {transcription}")
Command Line
Long-Form Audio Transcription
To test the model with a single long-form audio file. Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:
chunkformer-decode \
--model_checkpoint path/to/hf/checkpoint/repo \
--audio_file path/to/audio.wav \
--total_batch_duration 14400 \
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
Example Output:
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
Batch Audio Transcription
The data.tsv file must have at least one column named wav. Optionally, a column named txt can be included to compute the Word Error Rate (WER). Output will be saved to the same file.
chunkformer-decode \
--model_checkpoint path/to/hf/checkpoint/repo \
--audio_list path/to/data.tsv \
--total_batch_duration 14400 \
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
Example Output:
WER: 0.1234
Classification
To classify a single audio file:
chunkformer-decode \
--model_checkpoint path/to/classification/model \
--audio_file path/to/audio.wav
Training
See 🚀 Training Guide 🚀 for complete documentation.
Citation
If you use this work in your research, please cite:
@INPROCEEDINGS{10888640,
author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
doi={10.1109/ICASSP49660.2025.10888640}}
Acknowledgments
This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunkformer-1.2.2.tar.gz.
File metadata
- Download URL: chunkformer-1.2.2.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02b8f1f33f8d7c555ebe2add79fa2b4179ca7191fb5577837c41bd2a6a2417f6
|
|
| MD5 |
3ecbba0294aaab47f05387b81abf894a
|
|
| BLAKE2b-256 |
1126d8073f58ade34facdc68cf74155f7be760fd08fb1954b06c2996ca3a6caf
|
File details
Details for the file chunkformer-1.2.2-py3-none-any.whl.
File metadata
- Download URL: chunkformer-1.2.2-py3-none-any.whl
- Upload date:
- Size: 150.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdf799b83f028ecfba103ff60c52d38cc5c757a75ebe337d622026c482d28aa9
|
|
| MD5 |
75606c089b272673bed01d8fc0d53a60
|
|
| BLAKE2b-256 |
d2987648606ae33233e8e5cb1db005545036abe5a4a4f728c947ff9b10306042
|