Modifies OpenAI's Whisper to produce more reliable timestamps.

Project description

Stabilizing Timestamps for Whisper

This script modifies OpenAI's Whisper to produce more reliable timestamps.

jfk

https://user-images.githubusercontent.com/28970749/225825286-cdb14d70-566f-454b-a2b3-b61b4b3e09c9.mp4

What's new in 2.0.0 ?

updated to use Whisper's more reliable word-level timestamps method.
the more reliable word timestamps allows regrouping all words into segments with more natural boundaries.
can now suppress silence with Silero VAD (requires PyTorch 1.2.0+)
non-VAD silence suppression is also more robust
see Quick 1.X → 2.X Guide

https://user-images.githubusercontent.com/28970749/225826345-ef7115db-51e4-4b23-aedd-069389b8ae43.mp4

Features

more control over the timestamps than default Whisper
supports direct preprocessing with Demucs to isolate voice
support dynamic quantization to decrease memory usage for inference on CPU
lower memory usage than default Whisper when transcribing very long input audio tracks

Setup

pip install -U stable-ts

To install the lastest commit:

pip install -U git+https://github.com/jianfch/stable-ts.git

Command-line usage

Transcribe audio then save result as JSON file which contains the original inference results. This allows results to be reprocessed different without having to redo inference. Change audio.json to audio.srt to process it directly into SRT.

stable-ts audio.mp3 -o audio.json

Processing JSON file of the results into SRT.

stable-ts audio.json -o audio.srt

Transcribe multiple audio files then process the results directly into SRT files.

stable-ts audio1.mp3 audio2.mp3 audio3.mp3 -o audio1.srt audio2.srt audio3.srt

Python usage

import stable_whisper

model = stable_whisper.load_model('base')
# this modified model run just like the original model but accepts additional arguments
result = model.transcribe('audio.mp3')

result.to_srt_vtt('audio.srt')
result.to_ass('audio.ass')
# word_level=False : use only segment timestamps (i.e without the green highlight)
# segment_level=False : use only word timestamps

result.save_as_json('audio.json')
# save inference result for later processing

Tips

for reliable segment timestamps, do not disable word timestamps with word_timestamps=False because word timestamps is also used to correct segment timestamps
use demucs=True and vad=True for music
if audio is not transcribing properly compared to whisper, try mel_first=True at cost of more memory usuage for long audio tracks

Quick 1.X → 2.X Guide

results_to_sentence_srt(result, 'audio.srt') → result.to_srt_vtt('audio.srt', word_level=False)
results_to_word_srt(result, 'audio.srt') → result.to_srt_vtt('output.srt', segment_level=False)
results_to_sentence_word_ass(result, 'audio.srt') → result.to_ass('output.ass')
there's no need to stabilize segment after inference because they're already stabilized during inference
transcribe() returns a WhisperResult object which can be converted to dict with .to_dict(). e.g result.to_dict()

Regrouping Words

Stable-ts has a preset for regrouping words into different into segments with more natural boundaries. This preset is enabled by regroup=True. But there are other built-in regrouping methods that allow you to customize the regrouping logic. This preset is just a predefined a combination of those methods.

https://user-images.githubusercontent.com/28970749/226504985-3d087539-cfa4-46d1-8eb5-7083f235b429.mp4

result0 = model.transcribe('audio.mp3', regroup=True) # regroup is True by default
# regroup=True is same as below
result1 = model.transcribe('audio.mp3', regroup=False)
(
    result1
    .split_by_punctuation([('.', ' '), '。', '?', '？', ',', '，'])
    .split_by_gap(.5)
    .merge_by_gap(.15, max_words=3)
    .split_by_punctuation([('.', ' '), '。', '?', '？'])
)
# result0 == result1

Visualizing Suppression

Requirement: Pillow or opencv-python

Non-VAD Suppression

novad

import stable_whisper
# regions on the waveform colored red is where it will be likely be suppressed and marked to as silent
# [q_levels=20] and [k_size=5] are defaults for non-VAD.
stable_whisper.visualize_suppression('audio.mp3', 'image.png', q_levels=20, k_size = 5)

VAD Suppression

vad

# [vad_threshold=0.35] is defaults for VAD.
stable_whisper.visualize_suppression('audio.mp3', 'image.png', vad=True, vad_threshold=0.35)

Encode Comparison

import stable_whisper

stable_whisper.encode_video_comparison(
    'audio.mp3', 
    ['audio_sub1.srt', 'audio_sub2.srt'], 
    output_videopath='audio.mp4', 
    labels=['Example 1', 'Example 2']
)

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments

Includes slight modification of the original work: Whisper

Project details

Release history Release notifications | RSS feed

2.19.1

Aug 16, 2025

2.19.0

Mar 25, 2025

2.18.3

Jan 29, 2025

2.18.2

Jan 16, 2025

2.18.1

Jan 9, 2025

2.18.0

Dec 28, 2024

2.17.5

Oct 13, 2024

2.17.4

Sep 12, 2024

2.17.3

Jun 1, 2024

2.17.2

May 14, 2024

2.17.1

May 4, 2024

2.17.0

May 3, 2024

2.16.0

Apr 14, 2024

2.15.11

Apr 1, 2024

2.15.10

Mar 26, 2024

2.15.9

Mar 8, 2024

2.15.8

Feb 28, 2024

2.15.7

Feb 26, 2024

2.15.6

Feb 12, 2024

2.15.5

Feb 8, 2024

2.15.4

Feb 2, 2024

2.15.3

Jan 31, 2024

2.15.2

Jan 28, 2024

2.15.1

Jan 27, 2024

2.15.0

Jan 27, 2024

2.14.4

Jan 14, 2024

2.14.3

Jan 11, 2024

2.14.2

Dec 31, 2023

2.14.1

Dec 29, 2023

2.14.0

Dec 29, 2023

2.13.7

Dec 8, 2023

2.13.6 yanked

Dec 4, 2023

2.13.5

Nov 27, 2023

2.13.4

Nov 20, 2023

2.13.3

Nov 14, 2023

2.13.2

Nov 7, 2023

2.13.1

Oct 29, 2023

2.13.0

Oct 21, 2023

2.12.3

Oct 16, 2023

2.12.2

Oct 15, 2023

2.12.1

Oct 14, 2023

2.12.0

Oct 11, 2023

2.11.7

Oct 5, 2023

2.11.6

Oct 3, 2023

2.11.5

Oct 3, 2023

2.11.4

Sep 29, 2023

2.11.3

Sep 23, 2023

2.11.2

Sep 22, 2023

2.11.1

Sep 22, 2023

2.11.0

Sep 21, 2023

2.10.1

Sep 14, 2023

2.10.0

Sep 12, 2023

2.9.0

Aug 18, 2023

2.8.1

Aug 5, 2023

2.8.0 yanked

Aug 4, 2023

2.7.2

Jul 28, 2023

2.7.1

Jul 20, 2023

2.7.0

Jul 13, 2023

2.6.4

Jun 9, 2023

2.6.3

Jun 9, 2023

2.6.2

May 9, 2023

2.6.1

May 9, 2023

2.6.0

May 8, 2023

2.5.3

Apr 30, 2023

2.5.2

Apr 28, 2023

2.5.0

Apr 24, 2023

2.4.1

Apr 18, 2023

2.4.0

Apr 16, 2023

2.3.1

Apr 8, 2023

2.3.0

Apr 4, 2023

2.2.0

Mar 30, 2023

2.1.3

Mar 29, 2023

This version

2.1.2

Mar 28, 2023

2.1.1

Mar 22, 2023

2.1.0

Mar 21, 2023

2.0.4

Mar 20, 2023

2.0.3

Mar 19, 2023

2.0.2

Mar 18, 2023

2.0.1

Mar 17, 2023

2.0.0

Mar 17, 2023

1.4.0

Mar 10, 2023

1.3.0

Mar 6, 2023

1.2.0

Feb 26, 2023

1.1.5

Feb 23, 2023

1.1.4

Feb 16, 2023

1.1.3

Feb 15, 2023

1.1.2

Feb 15, 2023

1.1.1

Feb 15, 2023

1.1.1b0 pre-release yanked

Feb 15, 2023

1.1.0

Feb 15, 2023

1.0.3

Jan 18, 2023

1.0.2

Jan 6, 2023

1.0.1 yanked

Nov 25, 2022

1.0.0 yanked

Nov 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stable-ts-2.1.2.tar.gz (32.2 kB view details)

Uploaded Mar 28, 2023 Source

File details

Details for the file stable-ts-2.1.2.tar.gz.

File metadata

Download URL: stable-ts-2.1.2.tar.gz
Upload date: Mar 28, 2023
Size: 32.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.15

File hashes

Hashes for stable-ts-2.1.2.tar.gz
Algorithm	Hash digest
SHA256	`9c566b34fd0cefd683e8fa5a37a72f58261cd55dbaeb7950e447f37adc7db148`
MD5	`bbdf43c583b269b74bf285363819cc4c`
BLAKE2b-256	`2930b3ee0a104eac0bbcfe8a03f117bac50a7586eab5f91b410dae0ac56fb627`

See more details on using hashes here.

stable-ts 2.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Stabilizing Timestamps for Whisper

What's new in 2.0.0 ?

Features

Setup

Command-line usage

Python usage

Tips

Quick 1.X → 2.X Guide

Regrouping Words

Visualizing Suppression

Non-VAD Suppression

VAD Suppression

Encode Comparison

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes