ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

These details have not been verified by PyPI

Project links

Project description

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

This repository contains the implementation and supplementary materials for our ICASSP 2025 paper, "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription". The paper has been fully accepted by the reviewers with scores: 4/4/4.

paper.pdf: The ICASSP 2025 paper describing ChunkFormer.
reviews.pdf: Reviewers' feedback from the ICASSP review process.
rebuttal.pdf: Our rebuttal addressing reviewer concerns.

Introduction
Key Features
Installation
Usage
Citation
Acknowledgments

Introduction

ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a chunk-wise processing mechanism with relative right context and employs the Masked Batch technique to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios. chunkformer_architecture

Key Features

Transcribing Extremely Long Audio: ChunkFormer can transcribe audio recordings up to 16 hours in length with results comparable to existing models. It is currently the first model capable of handling this duration.
Efficient Decoding on Low-Memory GPUs: Chunkformer can handle long-form transcription on GPUs with limited memory without losing context or mismatching the training phase.
Masked Batching Technique: ChunkFormer efficiently removes the need for padding in batches with highly variable lengths. For instance, decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.

GPU Memory	Total Batch Duration (minutes)
80GB	980
24GB	240

Installation

Option 1: Install from PyPI (Recommended)

pip install chunkformer

Option 2: Install from source

# Clone the repository
git clone https://github.com/your-username/chunkformer.git
cd chunkformer

# Install in development mode
pip install -e .

Checkpoints

Language	Model
Vietnamese	khanhld/chunkformer-large-vie
English	khanhld/chunkformer-large-en-libri-960h

Dependencies

The package will automatically install all required dependencies including PyTorch, transformers, and other necessary libraries.

Usage

Python API Usage

import chunkformer

# Option 1: Load a pre-trained model from Hugging Face or local directory
model = chunkformer.ChunkFormerModel.from_pretrained("khanhld/chunkformer-large-vie")

# Option 2: Load from local checkpoint directory 
model = chunkformer.ChunkFormerModel.from_pretrained("path/to/model/checkpoint")

# For single long-form audio transcription
transcription = model.endless_decode(
    audio_path="path/to/long_audio.wav",
    chunk_size=64,
    left_context_size=128, 
    right_context_size=128,
    total_batch_duration=14400,  # in seconds
    return_timestamps=True
)
print(transcription)

# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
    audio_paths=audio_files,
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=1800  # Total batch duration in seconds
)

for i, transcription in enumerate(transcriptions):
    print(f"Audio {i+1}: {transcription}")

# For custom configuration
config = chunkformer.ChunkFormerConfig(
    chunk_size=32,
    left_context_size=64,
    right_context_size=64,
    vocab_size=4992
)
model = chunkformer.ChunkFormerModel(config)

Command Line Usage

After installation, you can use the command line interface:

chunkformer-decode \
    --model_checkpoint path/to/local/hf/checkpoint/repo \
    --long_form_audio data/common_voice_vi_23397238.wav \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Training the Model

For training/finetuning, follow this PR.

Long-Form Audio Testing

To test the model with a single long-form audio file. Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:

python decode.py \
    --model_checkpoint path/to/local/hf/checkpoint/repo \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Or using the command line tool:

chunkformer-decode \
    --model_checkpoint path/to/local/hf/checkpoint/repo \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Batch Transcription Testing

The audio_list.tsv file must have at least one column named wav. Optionally, a column named txt can be included to compute the Word Error Rate (WER). Output will be saved to the same file.

python decode.py \
    --model_checkpoint path/to/local/hf/checkpoint/repo \
    --audio_list path/to/audio_list.tsv \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Or using the command line tool:

chunkformer-decode \
    --model_checkpoint path/to/local/hf/checkpoint/repo \
    --audio_list path/to/audio_list.tsv \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

WER: 0.1234

Citation

If you use this work in your research, please cite:

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}

Acknowledgments

We would like to thank Zalo for providing resources and support for training the model. This work was completed during my tenure at Zalo.

This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.2

Nov 14, 2025

1.2.1

Oct 10, 2025

1.0.0

Sep 24, 2025

0.1.1

Sep 23, 2025

This version

0.1.0

Sep 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkformer-0.1.0.tar.gz (1.3 MB view details)

Uploaded Sep 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunkformer-0.1.0-py3-none-any.whl (46.3 kB view details)

Uploaded Sep 23, 2025 Python 3

File details

Details for the file chunkformer-0.1.0.tar.gz.

File metadata

Download URL: chunkformer-0.1.0.tar.gz
Upload date: Sep 23, 2025
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for chunkformer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`966abffaa61ce23258eefbbab73a8f6ed2ecfef79c0add60d2f7faedd2c922fd`
MD5	`2180619d81cbe4bb55f049132b8e99eb`
BLAKE2b-256	`479124e23d62dcf67e462da7d509df8b27a82d4a18b7bd40b7a02656676b8228`

See more details on using hashes here.

File details

Details for the file chunkformer-0.1.0-py3-none-any.whl.

File metadata

Download URL: chunkformer-0.1.0-py3-none-any.whl
Upload date: Sep 23, 2025
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for chunkformer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`65699c4f134db300de1df4f38e43263cbd1cc9ae6b26459ffca439a973f27ee1`
MD5	`127f5c57da497cd640b62b5eb08dc796`
BLAKE2b-256	`b0e93f353b246f878d3d7cb2415ba852a4be6e5eb4731c5c19d1b2602961bcd7`

See more details on using hashes here.

chunkformer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

Table of Contents

Introduction

Key Features

Installation

Option 1: Install from PyPI (Recommended)

Option 2: Install from source

Checkpoints

Dependencies

Usage

Python API Usage

Command Line Usage

Training the Model

Long-Form Audio Testing

Batch Transcription Testing

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes