Video transcription with speaker diarization and HTML output
Project description
transcribe-with-whisper
This set of tools is for people who need to transcribe video (or audio) files, but must protect the privacy of the people in the data set. This uses free AI tools and models to transcribe video and audio files to an HTML file that will show the transcript in your web browser and let you click on a word to be taken to that section of the original data file. A script to convert the HTML to docx is also included.
The docx files created include the speaker and timestamp so that it should be compatible with MAXQDA's timestamps.
It works on macOS (Intel & Apple Silicon), Linux, and Windows (not well tested).
I've tried very hard to make it work for people whose computer expertise includes little more than being able to install computer programs from a web page and click on stuff in a web browser.
Quick start
Two ways to use this project:
-
MercuryScribe (Web UI)
- Best for editing and reviewing in your browser
- Install:
pip install "transcribe-with-whisper[web]" - Run:
mercuryscribethen open http://localhost:5001 - More: see
docs/README-mercuryscribe.md
-
transcribe-with-whisper (CLI)
- Best for batch processing from the command line
- Install:
pip install transcribe-with-whisper - Run:
transcribe-with-whisper yourfile.mp4 [Speaker1 Speaker2 ...] - More: see
docs/README-transcribe-with-whisper.md
What this does
TL;DR: takes a video file, makes an HTML page that tracks the transcription with the playing video and makes video jump to text that you click. A .docx file with timestamps, which should be suitable for use with packages like MAXQDA is also created.
There is a command-line Python version, best if you just want to process a bunch of files, and an interactive version that runs a web server on your computer and lets you edit the text and speakers in your web browser.
📺 View Live Demo - Interactive HTML transcription with synchronized video playback
-
Takes a video file (.mp4, .mov, or .mkv) and creates an audio-only file (.wav) for Whisper to process. I think that only mp4 files are likely to display in your browser, but don't know right now. It should also work on audio-only files, though it may need some fairly simple modifications to do that.
-
Separates who is speaking when (speaker diarization using pyannote/speaker-diarization, a free AI model)
https://huggingface.co/pyannote/segmentation-3.0
- Transcribes each speaker's speech using the Faster Whisper Python library
- Produces an HTML file: you click on parts of the transcript, the video jumps to that moment
- The HTML file and the original video file are required to view the transcription in a web browser
Faster-Whisper doesn't know about different speakers, so the code uses another model to split the transcript into pieces by speaker that are then handed off to Whisper.
I can't find a good source what languages are supported, but something that seemed only mildly dubious claimed it was close to 100.
What Is Required? An Overview
tl;dr:
However you use this, you need to have a Hugging Face Auth Token to download the AI model (What is a model?) that does diarization (distinguishing multiple speakers in the transcript). Details below.
This is a Python package. If you're comfortable with Python, you can probably just pip3 install transcribe-with-whisper and the rest (like installing ffmpeg with brew) will make sense. After you install you would do something like "transcribe-with-whisper myvideofile.mp4 Harper Jordan Riley" and it'll create an HTML file with the transcript and a player for the video.
If you're not comfortable with Python, you can install Docker Desktop (or Docker engine) and use a Docker container that's updated automatically, and similarly run a command, or start up a container that will let you provide the file and speaker names in your web browser.
If you don't know which of those you are more comfortable with, the answer is probably Docker. If you don't know what brew is, you probably want Docker.
Hugging Face Auth Token is required (You have to read this!)
A couple of AI Models available at Hugging Face are required to make this work. Hugging Face requires you to create an account and request permission to use these models (permission is granted immediately). An Auth Token (a fancy name for a combined username and password, sort of) is required for this program to download those models. Here's how to get the HUGGING_FACE_AUTH_TOKEN.
- Create a free Hugging Face account
- Request access to each of the required models—click "Use this model" for pyannote.audio and accept their terms.
On each model page linked below, click “Use this model” and select "pyannote.audio" (pyannote.audio is a Python library). After you have accepted their terms, you should see "Gated Model You have been granted access to this model". You can also check which models you have access to at https://huggingface.co/settings/gated-repos.
Request Access for these Models!
- Required: pyannote/speaker-diarization-3.1 → https://huggingface.co/pyannote/speaker-diarization-3.1
- Required: pyannote/segmentation-3.0 → https://huggingface.co/pyannote/segmentation-3.0
- Required: https://huggingface.co/pyannote/speaker-diarization-community-1
- Create a read-access token
- Go to https://huggingface.co/settings/tokens
- Click “Create new token” and then select the "Read" token type.
- Enter a token name (maybe the computer you're using and/or the date) and click the "Create token" button.
- Copy the token (looks like
hf_...) and paste it somewhere safe. Keep it private. It will not be displayed again, so if you lose it, you have to get another one (if that happens, there's an option in invalidate and refresh; it's not a big deal).
- Set the token as an environment variable
- Linux/Windows WSL (bash):
export HUGGING_FACE_AUTH_TOKEN=hf_your_token_here
echo "export HUGGING_FACE_AUTH_TOKEN=$HUGGING_FACE_AUTH_TOKEN" >> ~/.bashrc
- For Mac (which uses zsh by default) use this to have it automatically added to your environment
export HUGGING_FACE_AUTH_TOKEN=hf_your_token_here
echo "export HUGGING_FACE_AUTH_TOKEN=$HUGGING_FACE_AUTH_TOKEN" >> ~/.zshrc
For both of the above examples, the first line sets the variable for the current terminal session and the second one adds it to a file that is read so that it will be set automatically in new terminal sessions.
- Windows (Command Prompt/PowerShell):
set HUGGING_FACE_AUTH_TOKEN="hf_your_token_here"
setx HUGGING_FACE_AUTH_TOKEN "%HUGGING_FACE_AUTH_TOKEN%"
Note: The set command sets the value for the current session, the setx command copies that value to make it permanent for future sessions.
Notes
- Only the pyannote diarization pipeline and segmentation requires the token; Faster-Whisper itself does not use Hugging Face auth.
- If you see a 401/403 error, ensure the token is set in your environment and that you accepted the model terms above.
Got Docker? (It's Easier for most people)
If you don't have Docker installed. You should head over to the Docker Desktop page and find the installation instructions. Maybe you don't care what Docker is and just want the download instructions for Mac, Windows, or Linux
If you use Windows, Docker requires you to install WSL ([https://learn.microsoft.com/en-us/windows/wsl/about](Windows Subsystem for Linux)). Instructions below assume that you are running bash as your shell; apparently, if you install Windows Terminal then, well, I don't know.
Remember above when it said that you needed to do this?
export HUGGING_FACE_AUTH_TOKEN=hf_your_token_here
Well, that's what makes the second line of the command below work.
You'll need to open a terminal and paste this in. On a Mac you can type "command-space" and then "terminal".
Web User Interface
Linux/Mac (bash/zsh):
docker pull ghcr.io/literatecomputing/transcribe-with-whisper-web:latest
docker run --rm -p 5001:5001 \
-v "$(pwd)/transcription-files:/app/transcription-files" \
ghcr.io/literatecomputing/transcribe-with-whisper-web:latest
Windows (PowerShell):
If you can't figure out how to get Windows Terminal to run bash, this should work in PowerShell.
docker run --rm -p 5001:5001 `
-e HUGGING_FACE_AUTH_TOKEN=$env:HUGGING_FACE_AUTH_TOKEN `
-v "${PWD}/transcription-files:/app/transcription-files" `
ghcr.io/literatecomputing/transcribe-with-whisper-web:latest
This command will get a newer Docker image if one is available (should work in all shells).
docker pull ghcr.io/literatecomputing/transcribe-with-whisper-web:latest
After that, you can open http://localhost:5001 in your web browser. The transcribed file will open in your browser and also be in the transcription-files folder that is created in the folder/directory where you run the above command. Both HTML and DOCX files are automatically generated for each transcription.
Command Line Interface
You do not need to edit this line, it uses the HUGGING_FACE_AUTH_TOKEN set above.
docker run --rm -it \
-e HUGGING_FACE_AUTH_TOKEN=$HUGGING_FACE_AUTH_TOKEN \
-v "$(pwd):/data" \
ghcr.io/literatecomputing/transcribe-with-whisper-cli:latest \
myfile.mp4 "Speaker 1" "Speaker 2"
This assumes that "myfile.mp4" is in the same directory/folder that you are in when you run that command (pro tip: the -v $(pwd):/data part gives docker access to the current directory).
Shell scripts exist in (bin/)
These are some shortcuts that will run the commands above. The above are more flexible, but these have sensible defaults and don't require you to know anything. If you don't know how to clone this repository, then just download the file you want from here.
bin/transcribe-with-whisper.sh— runs the Web UIbin/transcribe-with-whisper-cli.sh— runs the CLIbin/html-to-docx.sh-- converts the html file into a docx
Usage:
# Make sure they’re executable (first time only)
chmod +x bin/*.sh
# Web UI (then open http://localhost:5001)
export HUGGING_FACE_AUTH_TOKEN=hf_xxx
./bin/transcribe-with-whisper.sh
# CLI
export HUGGING_FACE_AUTH_TOKEN=hf_xxx
./bin/transcribe-with-whisper-cli.sh myfile.mp4 "Speaker 1" "Speaker 2"
Environment overrides:
TWW_PORT— web port (default: 5001)TWW_transcription-files_DIR— host transcription-files directory for the web server (default:./transcription-files)TWW_CLI_MOUNT_DIR— host directory to mount at/datafor the CLI (default: current directory)
These scripts pull and run the prebuilt multi-arch images from GHCR, so you don’t need to build locally.
🛠️ Running without Docker
If you know a bit about Python and command lines, you might prefer to use the Python version and skip the overhead of Docker (and see that dependencies are handled yourself!)
On a fresh Ubuntu 24.04 installation, this works:
apt update
apt install -y python3-pip python3.12-venv ffmpeg
python3 -m venv venv
source venv/bin/activate
pip install transcribe-with-whisper
This should work on a Mac:
brew update
brew install python ffmpeg
python3 -m venv venv
source venv/bin/activate
pip install transcribe-with-whisper
You can safely copy/paste the above, but these (same on all platforms) need for you to pay attention and insert your own token and filename.
export HUGGING_FACE_AUTH_TOKEN=hf_your_access_token
transcribe-with-whisper your-video.mp4
The script checks to see what may be missing, and tries to tell you what to do, so there's no harm in running it just to see if it works. When it doesn't you can come back and follow this guide. Also the commands that install the various pieces won't hurt anything if you run them when the tool is already installed.
The Windows installation instructions are written by ChatGPT and are not tested. The last version of Windows that I used for more than 15 minutes at a time was Windows 95, and that was mostly to make it work for other people.
| Requirement | Why it's needed |
|---|---|
| Python 3 | The script is written in Python. |
| ffmpeg | To convert video/audio files so the script can read them. |
| Hugging Face account + access token | For using the speech / speaker models. |
| Access to specific Hugging Face models | Some models have terms or require you to request access. |
| Some Python package-manager experience | You might have to fuss with dependencies |
✅ Installation & Setup — Step by Step
Below are clear steps by platform. Do them in order. Each “terminal / command prompt” line is something you type and run.
To open a Terminal on a Mac, you can type a command-space and type "terminal". This will open what some people call a "black box" where you type commands that the system processes.
1. Install basic tools
macOS (Intel or Apple Silicon)
-
Install Homebrew (if you don’t already have it): Open Terminal and paste:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
Use Homebrew to install
ffmpeg:
brew install ffmpeg
- Make sure you have Python 3:
brew install python
Linux (Ubuntu / Debian)
Open Terminal and run:
sudo apt update
sudo apt install ffmpeg python3 python3-pip -y
Windows
I think that if you install WSL the Ubuntu instructions should work without changes.
3. Configure your token on your computer
You need to tell your computer what your Hugging Face token is. This is so the script can access the models when it runs. Hopefully you got the token above and already did the "export" part once. The instructions below will put that in a place that will automatically get executed when you open a new terminal.
- macOS / Linux (in Terminal)
PAY ATTENTION HERE! See where it says "your_token_here" in the section below? You'll need to edit the the commands below. The easiest way is to paste this and then hit the up arrow to get back to the "export" command, use the arrow keys (YOUR MOUSE WILL NOT WORK!!!), and paste (using the command-V key) the token there "your_token_here" was.
echo 'export HUGGING_FACE_AUTH_TOKEN=your_token_here' >> ~/.zshrc
source ~/.zshrc
If you use Linux or WSL, you use bash instead of zsh , so do this instead:
echo 'export HUGGING_FACE_AUTH_TOKEN=your_token_here' >> ~/.bashrc
source ~/.bashrc
What you get
After the script runs:
- An HTML file, e.g.
myvideo.html— open this in your web browser - The resulting page will show the video plus a transcript; clicking on transcript sections jumps the video to that moment
-
The first time you run this, it may download some large model files. That is normal; it might take a few minutes depending on your internet speed. Subsequent runs will be much faster since those files will already have been downloaded.
-
On Macs with Apple Silicon (M1/M2/M3/M4), the default setup will still work, but performance may be slower than if you install optional “GPU / CoreML”-enabled packages (and have any idea what that means).
-
If something fails (missing library, inaccessible model, missing token), the script will try to give a friendly error message. If you see a message you don’t understand, you can share it with someone technical or open an issue.
Converting the HTML to a Word Processing document
While the HTML is great for viewing the data, it's not convenient for other tools you might want to use. There is an html-to-docx script available that will convert the HTML into a docx file by default (you can also specify other formats like html-to-docx file.html file.odt or html-to-docx file.html file.pdf).
Note that some tools can work with the .vtt files that are created in the directory created with the same name as the original file (without the filename extension). If you want to edit the .vtt files, you can re-run the script and it'll create a new HTML file with the contents from the .vtt files. The .vtt files, however, do not include information about the speaker, which makes them less desirable.
Recent Updates
- ✅ Auto-DOCX Generation: The web interface now automatically creates a
.docxfile alongside the HTML transcript - ✅ Fixed Video Player: Video player stays pinned at the top of the browser window while scrolling through transcripts
- ✅ Enhanced Timestamps: Transcripts include speaker names and timestamps for better DOCX export
TODO
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file transcribe_with_whisper-0.5.0.tar.gz.
File metadata
- Download URL: transcribe_with_whisper-0.5.0.tar.gz
- Upload date:
- Size: 44.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb3eb97980c13f69977d9a8c3d4c7be3f669ad3dba5513debe2e07b1381eac3c
|
|
| MD5 |
99efb52873f1c4d2083bdc5716c5ed7d
|
|
| BLAKE2b-256 |
56fcfd55f108055de2b5d15e6a5e80c1dfbd229ac858f3c624dca11d0cccb0bd
|
Provenance
The following attestation bundles were made for transcribe_with_whisper-0.5.0.tar.gz:
Publisher:
publish-pypi.yml on literatecomputing/transcribe-with-whisper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
transcribe_with_whisper-0.5.0.tar.gz -
Subject digest:
fb3eb97980c13f69977d9a8c3d4c7be3f669ad3dba5513debe2e07b1381eac3c - Sigstore transparency entry: 573901464
- Sigstore integration time:
-
Permalink:
literatecomputing/transcribe-with-whisper@5bc0a47766fb771def702baae5c0657bdf81ec8d -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/literatecomputing
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5bc0a47766fb771def702baae5c0657bdf81ec8d -
Trigger Event:
push
-
Statement type:
File details
Details for the file transcribe_with_whisper-0.5.0-py3-none-any.whl.
File metadata
- Download URL: transcribe_with_whisper-0.5.0-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
392f024a04375ed463fabbd6289fbbb77d905705e79046bd8d77bdb530cd422b
|
|
| MD5 |
96cf28ae0384e5440f3d5346afc883bb
|
|
| BLAKE2b-256 |
f38560c377d39339c644836a75e8cdd411d136c0b1eb3100d5e37999f68c3380
|
Provenance
The following attestation bundles were made for transcribe_with_whisper-0.5.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on literatecomputing/transcribe-with-whisper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
transcribe_with_whisper-0.5.0-py3-none-any.whl -
Subject digest:
392f024a04375ed463fabbd6289fbbb77d905705e79046bd8d77bdb530cd422b - Sigstore transparency entry: 573901475
- Sigstore integration time:
-
Permalink:
literatecomputing/transcribe-with-whisper@5bc0a47766fb771def702baae5c0657bdf81ec8d -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/literatecomputing
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5bc0a47766fb771def702baae5c0657bdf81ec8d -
Trigger Event:
push
-
Statement type: