Add language model support to HF Transformers' Whisper models
Project description
Whisper-LM-Transformers
KenLM and Large language model integration with Whisper ASR models implemented in Hugging Face library.
Installation
Install the package from PyPI:
pip install whisper-lm-transformers
Or clone and install locally:
git clone https://github.com/hitz-zentroa/whisper-lm-transformers.git
cd whisper-lm-transformers
pip install .
Besides, a recent version of [KenLM](pip install https://github.com/kpu/kenlm/archive/master.zip) is required to use n-gram language models:
pip install https://github.com/kpu/kenlm/archive/master.zip
Usage Examples
1) Using Hugging Face Pipeline
There is a new pipeline task called "whisper-with-lm". Once imported, you can do:
>>> from transformers import pipeline
>>> from huggingface_hub import hf_hub_download
>>> import whisper_lm_transformers # Required to register the new pipeline
>>> # Download the n-gram model
>>> lm_model = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
>>> # Example: KenLM-based decoding
>>> pipe = pipeline(
... "whisper-with-lm",
... model="zuazo/whisper-tiny-eu",
... lm_model=lm_model, # Provide a kenlm model path
... lm_alpha=0.33582369,
... lm_beta=0.68825565,
... language="eu",
... )
>>> # Transcribe an audio file or array
>>> pipe("tests/data/audio.wav")["text"]
'Talka diskoetxearekin grabatzen ditut beti abestien maketak.'
Note: In the example above, we use our Basque KenLM
model. Optimize
the lm_alpha, lm_beta, etc., for best results with your own models.
Integrating a Large Language Model
If you prefer to use a Large LM:
>>> # Load the pipeline
>>> pipe = pipeline(
... "whisper-with-lm",
... model="zuazo/whisper-tiny-eu",
... llm_model="HiTZ/latxa-7b-v1.2", # Hugging Face LLM name or path
... lm_alpha=2.73329396,
... lm_beta=0.00178595,
... language="eu",
... )
>>> # Transcribe an audio file or array
>>> pipe("tests/data/audio.wav")["text"]
'Talka diskoetxearekin grabatzen ditut beti abestien maketak.'
Caution: Running large LMs side-by-side with Whisper requires sufficient GPU memory.
2) Using the WhisperWithLM Class Directly
If you prefer manual control, you can use the WhisperWithLM class:
>>> from datasets import Audio, load_dataset
>>> from transformers import WhisperProcessor
>>> from whisper.audio import load_audio
>>> from whisper_lm_transformers import WhisperWithLM
>>> # Load the model
>>> model_name = "zuazo/whisper-tiny-eu"
>>> processor = WhisperProcessor.from_pretrained(model_name)
>>> model = WhisperWithLM.from_pretrained(model_name)
>>> # Load an audio example
>>> ds = load_dataset("openslr", "SLR76", split="train", trust_remote_code=True)
>>> audio = load_audio(ds[28]["audio"]["path"])
>>> # Process the audio and generate the output
>>> inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")
>>> generated = model.generate(
... input_features=inputs["input_features"],
... tokenizer=processor.tokenizer,
... lm_model="tests/5gram-eu.bin", # Provide a kenlm model path
... lm_alpha=0.33582369,
... lm_beta=0.68825565,
... num_beams=5,
... language="eu",
... )
>>> processor.decode(generated[0], skip_special_tokens=True)
'Talka diskoetxearekin grabatzen ditut beti abestien maketak.'
Audio Processing Note
In the last example, we used OpenAI’s load_audio() function for reproduction.
You can also use
standard HF audio processing methods
, e.g. ds.cast_column("audio", Audio(sampling_rate=16000)). However, keep
consistent sample rates and methods, as different audio preprocessing can yield
different internal logits, thus altering the final LM integration results. For
example, if you have optimized the language model using our
whisper-lm repository based on
OpenAI's Whisper implementation, we recommend re-running the optimization with
the scripts provided here for the best results.
Included Scripts
The package includes the following scripts:
whisper_evaluate_with_hf: Evaluates a Whisper model in a dataset.whisper_lm_optimizer_with_hf: Optimize the n-gram or large language model.
Run them with --help to see how to use them.
Contributing
Contributions, bug reports, and feature requests are welcome! Please check out CONTRIBUTING.md for details on how to set up your environment and run tests before submitting changes.
Citation
If you find this helpful in your research, please cite:
@misc{dezuazo2025whisperlmimprovingasrmodels,
title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
year={2025},
eprint={2503.23542},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.23542},
}
Please, check the related paper preprint in arXiv:2503.23542 for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file whisper_lm_transformers-0.2.0.tar.gz.
File metadata
- Download URL: whisper_lm_transformers-0.2.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4997837f01f26ec07849874cb2d77f918827cfd88096eaac6c70bb9aeb4a812f
|
|
| MD5 |
720e28b3ce627d2f007d9bde4651814d
|
|
| BLAKE2b-256 |
20b55a2eb82bd449acf7b1af756c2ad59b0546dfb9bc4c28e13fd1ca2b98bb1a
|
File details
Details for the file whisper_lm_transformers-0.2.0-py3-none-any.whl.
File metadata
- Download URL: whisper_lm_transformers-0.2.0-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03950c80ca2ba822dee2a807f005e22f0f3e03e25e992e59f1379615c146c534
|
|
| MD5 |
cabe1f886ba49e1c4a6f54e2611cd0f0
|
|
| BLAKE2b-256 |
601e8b97dc2cf4e2e50c293098db65224f5d2c2cdf931c09e24c12c73de5056b
|