Skip to main content

Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!

Project description

whisply

Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!

whisply combines faster-whisper and insanely-fast-whisper to offer an easy-to-use solution for batch processing files. It also enables word-level speaker annotation by integrating whisperX and pyannote.

Table of contents

Features

  • ๐Ÿšดโ€โ™‚๏ธ Performance: Depending on your hardware whisply will use the fastest Whisper implementation:

    • CPU: fast-whisper or whisperX
    • GPU (Nvidia CUDA) and MPS (Metal Performance Shaders, Apple M1-M3): insanely-fast-whisper or whisperX
  • โœ… Auto device selection: When performing transcription or translation tasks without speaker annotation or subtitling, faster-whisper (CPU) or insanely-fast-whisper (MPS, Nvidia GPUs) will be selected automatically based on your hardware if you do not provide a device by using the --device option.

  • ๐Ÿ—ฃ๏ธ Word-level annotations: If you choose to --subtitle or --annotate, whisperX will be used, which supports word-level segmentation and speaker annotations. Depending on your hardware, whisperX can run either on CPU or Nvidia GPU (but not on Apple MPS). Out of the box whisperX will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): whisply fixes those instances through timestamp approximation.

  • ๐Ÿ’ฌ Subtitles: Generating subtitles is customizable. You can specify the number of words per subtitle block (e.g., choosing "5" will generate .srt and .webvtt files where each subtitle block exactly 5 words per segment with the corresponding timestamps).

  • ๐Ÿงบ Batch processing: whisply can process single files, whole folders, URLs or a combination of all by combining paths in a .list document. See the Batch processing section for more information.

  • โš™๏ธ Supported output formats: .json .txt .txt (annotated) .srt .webvtt .vtt .rttm

Requirements

  • FFmpeg
  • >= Python3.10
  • GPU processing requires:
    • Nvidia GPU (CUDA: cuBLAS and cuDNN 8 for CUDA 12)
    • Apple Metal Performance Shaders (MPS) (Mac M1-M3)
  • Speaker annotation requires a HuggingFace Access Token
GPU Fix for Could not load library libcudnn_ops_infer.so.8. (click to expand)
If you use whisply on a Linux system with a Nivida GPU and get this error:

"Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory"

Run the following line in your CLI:

export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`

Add this line to your Python environment to make it permanent:

echo "export LD_LIBRARY_PATH=\`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + \":\" + os.path.dirname(nvidia.cudnn.lib.__file__))'\`" >> path/to/your/python/env

For more information please refer to the faster-whisper GitHub page.

Installation

1. Install ffmpeg

--- macOS ---
brew install ffmpeg

--- Linux ---
sudo apt-get update
sudo apt-get install ffmpeg

--- Windows ----
https://ffmpeg.org/download.html

2. Clone this repository and change to project folder

git clone https://github.com/tsmdt/whisply.git
cd whisply

3. Create a Python virtual environment

python3.11 -m venv venv

4. Activate the Python virtual environment

source venv/bin/activate

5. Install whisply with pip

pip install .

Usage

 Usage: whisply [OPTIONS]

 WHISPLY ๐Ÿ’ฌ Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --files               -f       TEXT                Path to file, folder, URL or .list to process. [default: None]                                          โ”‚
โ”‚ --output_dir          -o       DIRECTORY           Folder where transcripts should be saved. [default: transcriptions]                                     โ”‚
โ”‚ --device              -d       [auto|cpu|gpu|mps]  Select the computation device: CPU, GPU (NVIDIA), or MPS (Mac M1-M3). [default: auto]                   โ”‚
โ”‚ --model               -m       TEXT                Whisper model to use (List models via --list_models). [default: large-v2]                               โ”‚
โ”‚ --lang                -l       TEXT                Language of provided file(s) ("en", "de") (Default: auto-detection). [default: None]                    โ”‚
โ”‚ --annotate            -a                           Enable speaker annotation (Saves .rttm).                                                                โ”‚
โ”‚ --hf_token            -hf      TEXT                HuggingFace Access token required for speaker annotation. [default: None]                               โ”‚
โ”‚ --translate           -t                           Translate transcription to English.                                                                     โ”‚
โ”‚ --subtitle            -s                           Create subtitles (Saves .srt, .vtt and .webvtt).                                                        โ”‚
โ”‚ --sub_length                   INTEGER             Subtitle segment length in words [default: 5]                                                           โ”‚
โ”‚ --verbose             -v                           Print text chunks during transcription.                                                                 โ”‚
โ”‚ --config                       PATH                Path to configuration file. [default: None]                                                             โ”‚
โ”‚ --list_filetypes                                   List supported audio and video file types.                                                              โ”‚
โ”‚ --list_models                                      List available models.                                                                                  โ”‚
โ”‚ --install-completion                               Install completion for the current shell.                                                               โ”‚
โ”‚ --show-completion                                  Show completion for the current shell, to copy it or customize the installation.                        โ”‚
โ”‚ --help                                             Show this message and exit.                                                                             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Speaker annotation and diarization

Requirements

In order to annotate speakers using --annotate you need to provide a valid HuggingFace access token using the --hf_token option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the pyannote segmentation model. For detailed instructions, refer to the Requirements section on the pyannote model page on HuggingFace.

How speaker annotation works

whisply uses whisperX for speaker diarization and annotation. Instead of returning chunk-level timestamps like the standard Whisper implementation whisperX is able to return word-level timestamps as well as annotating speakers word by word, thus returning much more precise annotations.

Out of the box whisperX will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): whisply fixes those instances through timestamp approximation. Other known limitations of whisperX include:

  • inaccurate speaker diarization if multiple speakers talk at the same time
  • to provide word-level timestamps and annotations whisperX uses language specific alignment models; out of the box whisperX supports these languages: en, fr, de, es, it, ja, zh, nl, uk, pt.

Refer to the whisperX GitHub page for more information.

Batch processing

Instead of providing a file, folder or URL by using the --files option you can pass a .list with a mix of files, folders and URLs for processing.

Example:

$ cat my_files.list

video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo

Using config files for batch processing

You can provide a .json config file by using the --config option which makes batch processing easy. An example config looks like this:

{
    "files": "./files/my_files.list",          # Path to your files
    "output_dir": "./transcriptions",          # Output folder where transcriptions are saved
    "device": "auto",                          # AUTO, GPU, MPS or CPU
    "model": "large-v3-turbo",                 # Whisper model to use
    "lang": null,                              # Null for auto-detection or language codes ("en", "de", ...)
    "annotate": false,                         # Annotate speakers 
    "hf_token": "HuggingFace Access Token",    # Your HuggingFace Access Token (needed for annotations)
    "translate": false,                        # Translate to English
    "subtitle": false,                         # Subtitle file(s)
    "sub_length": 10,                          # Length of each subtitle block in number of words
    "verbose": false                           # Print transcription segments while processing 
}

Project details


Release history Release notifications | RSS feed

This version

0.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whisply-0.3.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

whisply-0.3-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file whisply-0.3.tar.gz.

File metadata

  • Download URL: whisply-0.3.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for whisply-0.3.tar.gz
Algorithm Hash digest
SHA256 05cadfb606eb0b456ebff2dba151c9659ce55dc0b4737f6427959d18e2645ed6
MD5 bc6caf7a4a2769696629116e3814bb3d
BLAKE2b-256 65783a79c291655125b9466e24a39deb90f6e89c8af9191d62f625dd2e5c508e

See more details on using hashes here.

File details

Details for the file whisply-0.3-py3-none-any.whl.

File metadata

  • Download URL: whisply-0.3-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for whisply-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 723ad68cfba2344814f9947d01dc89921cc76e3ba8bc721e362ef3b0119fb3df
MD5 196c1533a2e1c04141777617ae2893c7
BLAKE2b-256 4625629566559d44658d84a83212899995b4ff7eae9373282c960936e8a8e4af

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page