Skip to main content

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

Project description

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription


This repository contains the implementation and supplementary materials for our ICASSP 2025 paper, "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription". The paper has been fully accepted by the reviewers with scores: 4/4/4.

Ranked #1: Speech Recognition on Common Voice Vi Ranked #1: Speech Recognition on VIVOS

  • paper.pdf: The ICASSP 2025 paper describing ChunkFormer.
  • reviews.pdf: Reviewers' feedback from the ICASSP review process.
  • rebuttal.pdf: Our rebuttal addressing reviewer concerns.

Table of Contents

Introduction

ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a chunk-wise processing mechanism with relative right context and employs the Masked Batch technique to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios. chunkformer_architecture

Key Features

  • Transcribing Extremely Long Audio: ChunkFormer can transcribe audio recordings up to 16 hours in length with results comparable to existing models. It is currently the first model capable of handling this duration.
  • Efficient Decoding on Low-Memory GPUs: Chunkformer can handle long-form transcription on GPUs with limited memory without losing context or mismatching the training phase.
  • Masked Batching Technique: ChunkFormer efficiently removes the need for padding in batches with highly variable lengths. For instance, decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.
GPU Memory Total Batch Duration (minutes)
80GB 980
24GB 240

Installation

Option 1: Install from PyPI (Recommended)

pip install chunkformer

Option 2: Install from source

# Clone the repository
git clone https://github.com/your-username/chunkformer.git
cd chunkformer

# Install in development mode
pip install -e .

Pretrained Models

Language Model
Vietnamese khanhld/chunkformer-large-vie
English khanhld/chunkformer-large-en-libri-960h

Usage

Python API Usage

from chunkformer import ChunkFormerModel

# Load a pre-trained model from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-large-vie")

# For single long-form audio transcription
transcription = model.endless_decode(
    audio_path="path/to/long_audio.wav",
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=14400,  # in seconds
    return_timestamps=True
)
print(transcription)

# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
    audio_paths=audio_files,
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=1800  # Total batch duration in seconds
)

for i, transcription in enumerate(transcriptions):
    print(f"Audio {i+1}: {transcription}")

Command Line Usage

After installation, you can use the command line interface:

Long-Form Audio Testing

To test the model with a single long-form audio file. Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:

chunkformer-decode \
    --model_checkpoint path/to/hf/checkpoint/repo \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Batch Transcription Testing

The audio_list.tsv file must have at least one column named wav. Optionally, a column named txt can be included to compute the Word Error Rate (WER). Output will be saved to the same file.

chunkformer-decode \
    --model_checkpoint path/to/hf/checkpoint/repo \
    --audio_list path/to/audio_list.tsv \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

WER: 0.1234

Training

For training/finetuning ChunkFormer models, follow the implementation in this WeNet PR.

Setting Up Your Model for Inference

After training is complete, you need to prepare your model for use with this library. Follow these steps:

Step 1: Create Model Directory

mkdir my_chunkformer_model
cd my_chunkformer_model

Step 2: Copy Required Files

Copy the following files from your training output to the model directory:

  1. Model Checkpoint

    # Supported formats: .pt, .ckpt, .bin.
    # Wenet uses .pt by default
    cp /path/to/your/final.pt pytorch_model.pt
    
  2. Training Configuration

    cp /path/to/your/train.yaml config.yaml
    
  3. CMVN Statistics

    cp /path/to/your/global_cmvn global_cmvn
    
  4. Vocabulary File (_units.txt):

    cp /path/to/your/_units.txt vocab.txt
    

Step 3: Verify Model Structure

Your model directory should look like this:

my_chunkformer_model/
├── pytorch_model.pt (or .ckpt/.bin)
├── config.yaml
├── global_cmvn
└── vocab.txt

Step 4: Test Your Local Model Directory

import chunkformer
model = chunkformer.ChunkFormerModel.from_pretrained("./my_chunkformer_model")
result = model.endless_decode("test_audio.wav")
print(result)

Citation

If you use this work in your research, please cite:

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}

Acknowledgments

We would like to thank Zalo for providing resources and support for training the model. This work was completed during my tenure at Zalo.

This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkformer-1.0.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkformer-1.0.0-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file chunkformer-1.0.0.tar.gz.

File metadata

  • Download URL: chunkformer-1.0.0.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for chunkformer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1e3d7d1a0b3fa54ced9d7add5367e3135a3fcdb0179f71574d89e1858fbb8c70
MD5 bf1027a976175860f76fc0e60689f5fc
BLAKE2b-256 da4edf611256b3c658020c33d2fc7ad13c81ca499cea0fb871cad720a83a3f6d

See more details on using hashes here.

File details

Details for the file chunkformer-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: chunkformer-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 43.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for chunkformer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6b1e6d55d7d3a3be53fb04be9b74e14c43619278c28c8a02a737c1b1db2b459
MD5 9ab77192f58cb4d76ff893e552c703ad
BLAKE2b-256 98ca53bc099091bfdec7fae39c34bea0d28a2009feded7d40e1f17ca2aefd146

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page