A tool to identify TV episodes from video files and rename them to match Plex naming conventions.
Project description
TVIdentify 
Python tool for automatically identifying and renaming TV Show episodes for Plex given the video files and series name. Identifies TV show episodes from video files using OCR on PGS subtitles and LLM analysis of subtitles.
Get Started
To just install the package and use it as a utility
python3 -m venv tvidentify
cd tvidentify
source bin/activate
pip install tvidentify
tvidentify /path/to/TVShows/Game\ Of\ Thrones/Season\ 02/ --max-frames 10 --offset 3 --series-name "Game Of Thrones" --scan-duration 5 --output-dir ~/gots2 --model gemini-3-pro-preview --rename --skip-already-named
To modify the sources or build/work from source
git clone https://github.com/ram-nat/tvidentify tvidentify
cd tvidentify
python3 -m venv venv
source venv/bin/activate
pip install -e .
tvidentify /path/to/TVShows/Game\ Of\ Thrones/Season\ 02/ --max-frames 10 --offset 3 --series-name "Game Of Thrones" --scan-duration 5 --output-dir ~/gots2 --model gemini-3-pro-preview --rename --skip-already-named
Usage
usage: tvidentify [-h] --series-name SERIES_NAME [--size-threshold SIZE_THRESHOLD] [--provider {google,openai,perplexity}] [--model MODEL] [--max-frames MAX_FRAMES] [--subtitle-track SUBTITLE_TRACK] [--offset OFFSET] [--scan-duration SCAN_DURATION] [--output-dir OUTPUT_DIR]
[--rename] [--rename-format RENAME_FORMAT] [--skip-already-named]
input_dir
Batch identify TV show episodes in a directory.
positional arguments:
input_dir The directory containing video files.
options:
-h, --help show this help message and exit
--series-name SERIES_NAME
The name of the TV series.
--size-threshold SIZE_THRESHOLD
Size similarity threshold for filtering episodes (default: 0.7).
--provider {google,openai,perplexity}
LLM provider to use (default: google).
--model MODEL Model name. If not provided, defaults based on provider.
--max-frames MAX_FRAMES
Maximum number of subtitle events to process (default: 10).
--subtitle-track SUBTITLE_TRACK
The subtitle track index to use (default: 0).
--offset OFFSET Skip the first N minutes for subtitle extraction (default: 0).
--scan-duration SCAN_DURATION
How many minutes to scan for subtitles from the offset (default: 15).
--output-dir OUTPUT_DIR
Optional directory to save JSON output files (one per video) instead of printing to console.
--rename Rename files to "<series_name> S<season>E<episode>" format if identification is successful.
--rename-format RENAME_FORMAT
Format for renamed files. Available placeholders: {{series}}, {{season}}, {{episode}}. Default: "{{series}} S{{season:02d}}E{{episode:02d}}"
--skip-already-named Skip files that are already in the expected naming format (only when --rename is specified).
Features
- Subtitle Extraction:
subtitle_extractor.pyis the stand-alone module for this.- Extracts English subtitle stream (expects and handles PGS only)
- You can specify starting offset and duration of subtitle stream to extract (so entire file is not processed). You can also specify the maximum number of subtitle events to extract.
--offsetto specify starting offset in minutes--scan-durationto specify how many minutes from starting offset you want to extract.--max-framesto specify how many subtitle events to extract.
- Uses OCR (
pytesseract) and some very basic regex clean-up to get subtitle text. - PGS parsing code is from https://github.com/EzraBC/pgsreader
- Use
--output-dirto store output in json format.
- Episode Identification
- episode_identifier.py is the stand-alone module for this.
- With extracted subtitles and the series name, use LLMs to identify the episode of the series.
- Supports different LLM providers - Google Gemini, OpenAI or Perplexity
- Use
--modelto pass the model to use for episode identification - You can pass an MKV (or other container format) file or the json output from subtitle_extractor.py as input to this stage.
- Use
--output-dirto store output in json format.
- Batch Identification
batch_identifier.pyis the stand-alone module for this.- Pass an entire season folder to identify all episodes in folder.
- Identifies and ignores non-episode files (assumes largest files are episodes)
- Identifies and does not process duplicate episode files (uses subtitle similarity for duplicates)
- Use
--renameoption to rename identified episodes to match Plex episode naming requirements. - Use
--output-dirto store output in json format. Stores both batch results and results for individual files.
- File Renaming
file_renamer.pyis the stand-alone module for this.- Use
--rename-formatto specify the rename format. Series, season and episode are the available variables for the format string.
Installation
- Clone the repository
- Set up the Python virtual environment (already configured)
- Install required packages:
pip install -r requirements.txt
Configuration
Set the appropriate API key environment variables:
# Google Gemini
export GOOGLE_API_KEY="your-google-api-key"
# OpenAI
export OPENAI_API_KEY="your-openai-api-key"
# Perplexity
export PERPLEXITY_API_KEY="your-perplexity-api-key"
Usage
Extract Subtitles from Video
python subtitle_extractor.py /path/to/video.mkv \
--output-dir ./subtitles
Identify Episode from Video File
python episode_identifier.py /path/to/video.mkv \
--series-name "Game of Thrones" \
--provider google
Identify Episode from Pre-extracted Subtitles
python episode_identifier.py \
--series-name "Game of Thrones" \
--subtitles-json subtitles.json \
--provider openai
Batch Processing an Entire Season
python batch_identifier.py /path/to/episodes/directory \
--series-name "Game of Thrones" \
--provider google \
--rename
Command-line Options
subtitle_extractor.py
input_file: Path to the video file to extract subtitles from--max-frames: Maximum number of subtitle events to extract--subtitle-track: Subtitle track index to use (default: 0)--offset: Skip first N minutes (default: 0)--scan-duration: Minutes to scan from offset (default: 15)--output-dir: Directory to save JSON output file
episode_identifier.py
input_file(optional): Path to video file (required if--subtitles-jsonnot provided)--series-name(required): Name of the TV series--provider: LLM provider (default: google). Options: google, openai, perplexity--model: Model name. Defaults: gemini-2.5-flash (google), gpt-4 (openai), sonar (perplexity)--subtitles-json: Path to JSON file with pre-extracted subtitles (alternative to video input)--max-frames: Maximum number of subtitle events to process (default: 10)--subtitle-track: Subtitle track index to use (default: 0)--offset: Skip first N minutes (default: 0)--scan-duration: Minutes to scan from offset (default: 15)--output-dir: Directory to save JSON output file
batch_identifier.py
input_dir: Directory containing video files to process--series-name(required): Name of the TV series--size-threshold: File size similarity threshold for filtering episodes (default: 0.7)--provider: LLM provider (default: google). Options: google, openai, perplexity--model: Model name. Defaults: gemini-2.5-flash (google), gpt-4 (openai), sonar-pro (perplexity)--max-frames: Maximum number of subtitle events to process (default: 10)--subtitle-track: Subtitle track index to use (default: 0)--offset: Skip first N minutes (default: 0)--scan-duration: Minutes to scan from offset (default: 15)--output-dir: Directory to save JSON output files--rename: Rename identified episodes to match Plex naming format--rename-format: Format string for renamed files (default:{series} S{season:02d}E{episode:02d})--skip-already-named: Skip files that are already in the expected naming format (only when--renameis specified)
file_renamer.py
--batch-results(required): Path to batch_results.json from batch_identifier--series-name(required): Name of the TV series--rename-format: Format string for renamed files. Available placeholders:{series},{season},{episode}(default:{series} S{season:02d}E{episode:02d})--dry-run: Show what would be renamed without actually renaming
Components
subtitle_extractor.py
Handles extraction of subtitles from video files:
- Detects subtitle tracks using ffprobe
- Extracts frames for each subtitle event
- Performs OCR on frames using Tesseract
- Filters gibberish using character pattern analysis
episode_identifier.py
Identifies TV show episodes from subtitles:
- Loads subtitles from video or JSON file
- Sends subtitles to LLM with identifying prompt
- Parses LLM response for season/episode information
- Supports multiple LLM providers
batch_identifier.py
Processes multiple video files:
- Discovers episode files by size similarity
- Processes each file with episode_identifier
- Outputs results in JSON format
Requirements
System Dependencies
ffmpeg- For video processingffprobe- For reading video metadata (comes with ffmpeg)tesseract-ocr- For optical character recognition (OCR) on subtitle images
Install on Ubuntu/Debian:
sudo apt-get install ffmpeg tesseract-ocr
Install on macOS:
brew install ffmpeg tesseract
Python Dependencies
See requirements.txt - includes:
opencv-python-headless- For video frame processingpytesseract- Python interface to Tesseract OCRopenai- OpenAI API clientgoogle-genai- Google Generative AI client
Example Output
{
"season": 1,
"episode": 2,
"subtitles": [
"Sorry, Your Grace.",
"My deepest apologies.",
"No. No, Your Grace."
],
"provider": "google",
"model": "gemini-2.5-flash"
}
Notes
- PGS subtitles are image-based, so OCR quality depends on video resolution and subtitle clarity
- In my tests,
gemini-3-pro-previewhas been the best model at identifying episodes consistently and correctly. - About 5 minutes of subtitle input has been sufficient to identify GOT episodes in my testing.
License
MIT License
Copyright (c) 2025
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tvidentify-0.1.3.tar.gz.
File metadata
- Download URL: tvidentify-0.1.3.tar.gz
- Upload date:
- Size: 33.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b047566184598453cfe8b49b570db3218ec20bf9798dfd2d2be7d76c0891a1e
|
|
| MD5 |
f3cdd7f0c6aa5a00054ec9f873e6a7fb
|
|
| BLAKE2b-256 |
1f54812b9aad2daaa3fb59d773717a06cb0eb718fb91972bf363be76928e225a
|
Provenance
The following attestation bundles were made for tvidentify-0.1.3.tar.gz:
Publisher:
wheels.yml on ram-nat/tvidentify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tvidentify-0.1.3.tar.gz -
Subject digest:
1b047566184598453cfe8b49b570db3218ec20bf9798dfd2d2be7d76c0891a1e - Sigstore transparency entry: 789940013
- Sigstore integration time:
-
Permalink:
ram-nat/tvidentify@e1cc48e302aa4055490eb6cef82c567e7f823620 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/ram-nat
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
wheels.yml@e1cc48e302aa4055490eb6cef82c567e7f823620 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tvidentify-0.1.3-py3-none-any.whl.
File metadata
- Download URL: tvidentify-0.1.3-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f0b79774eb65dd4d134104fb5db745483845f962e3100fc4c4dfb24899be9cd
|
|
| MD5 |
8e87f2494332097c50be7fd427194f61
|
|
| BLAKE2b-256 |
6836316a7066d5a4d2d0a3f9307b491c934b2cd2d0addc7216d8a1acc051c3ef
|
Provenance
The following attestation bundles were made for tvidentify-0.1.3-py3-none-any.whl:
Publisher:
wheels.yml on ram-nat/tvidentify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tvidentify-0.1.3-py3-none-any.whl -
Subject digest:
2f0b79774eb65dd4d134104fb5db745483845f962e3100fc4c4dfb24899be9cd - Sigstore transparency entry: 789940016
- Sigstore integration time:
-
Permalink:
ram-nat/tvidentify@e1cc48e302aa4055490eb6cef82c567e7f823620 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/ram-nat
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
wheels.yml@e1cc48e302aa4055490eb6cef82c567e7f823620 -
Trigger Event:
push
-
Statement type: