Skip to main content

Approximate word-based subtitle timing for karaokes

Project description

Auto Karaoke

Approximate word-based subtitle timing for karaokes. Further subtitle editing is necessary for usable karaokes. This is meant as a time-saving tool.



Run pip3 install git+ BEFORE installing auto_karaoke

whisper-timestamped does not have a PiPy package, so it is unable to be referenced it in pyproject.toml.

Installing whisper-timestamped after auto_karaoke may cause permission issues with your venv.

Additional installation docs here:


Run pip install auto-karaoke


usage: autokaraoke [-h] [--encoding {utf-8,windows-1252}]
                   [--language {...}]
                   song_path lyrics_path model_size

Create karaoke .ass file from song audio and lyric text

positional arguments:
  song_path             file path for song audio file
  lyrics_path           file path for lyric text file
  model_size {tiny,small,medium,large,large-v2}
                        model size for whisper (default: medium)

optional arguments:
  -h, --help            show this help message and exit
  --encoding {utf-8,windows-1252}
                        text encoding of lyric text file (default: utf-8)
  --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,
,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Itali
                        language spoken in the audio, omit to perform language detection (default: None)

GUI - opened when running CLI tool

The transcription from openai/whisper will contain inaccuracies.

  • 1-to-1 word inaccuracies are fine because the timing is the same regardless
    • ex: searing vs cheering
  • missing or extra words are problematic because the timings don't matchup
    • LYRIC: I've infiltrated vs TRANSCRIPTION: Infiltrated
    • LYRIC: Insurmountable vs TRANSCRIPTION: I see my table

There is a GUI to fix these mistakes before the subtitle file is created. The goal is to have the number of words in the lyrics be the same as the number of transcribed words so that the word timings match up 1-to-1.

The GUI shows the lyrics given by the CLI argument in a grid:

  • Each lyric line is a row
  • Each word in that lyric line is a column
  • Each cell in a row x column holds 2 words stacked on top of each other:
    • the top word is the lyric word
    • the bottom word is the transcribed word from openai/whisper
    • the cell is highlighted green if the two words match and red if the words don't match
  • Overflow line
    • if there are extra words left in the transcription after matching 1-1 lyric words with transcribed words, they are shown in the bottom of the GUI
    • this is only for reference

The GUI provides multiple ways to edit the bottom/transcribed word of a cell. By right-clicking a transcribed word, a menu opens up with various options:

  • Add word
    • inserts a blank word
    • words after new word are moved to the right
  • Match word with lyric
    • replaces text content of word with the lyric word above it
    • NOTE: this is not necessary to do as long as the number of lyric words match transcribed words, the lyric word and transcribed word don't have to match in text content. This is mainly to change the color of the cell to green for the user's preference
  • Delete word
    • Deletes word
    • words after deleted word are moved to the left
  • Merge word with right
    • merges the text content and the timing with the word on the right
    • words after merged word are moved to the left
  • Split word
    • splits the word in half
    • timing is only approximate because it is literally halving the duration
    • words right of split word are moved to the right
  • Undo | shortcut: ctrl-z
  • Redo | shortcut: ctrl-y

Most of the time, you can hear in the song how these inaccuracies were made. It can be helpful to listen to sections of the song again while using the GUI.


If you have CUDA-enabled hardware, you can replace the torch packages with:

pip3 install torch torchvision torchaudio --index-url --force-reinstall


  • Add pictures to README
  • Update overflow line when edits are made
  • Support different karaoke styles
    • Choir
    • Down
    • Duo (Voice1, Voice2, Voice1+2)

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auto_karaoke-0.0.6.tar.gz (13.9 kB view hashes)

Uploaded Source

Built Distribution

auto_karaoke-0.0.6-py3-none-any.whl (11.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page