Automated Japanese vocabulary mining from anime subtitles with Anki integration
Project description
Anki Miner
Automated Japanese vocabulary mining from anime subtitles. Extracts unknown words, fetches definitions, pulls screenshots and audio from video files, and creates Anki flashcards automatically.
Showcase
App Showcase (latest v2.2.0 release)
Cards Created with Anki Miner
How It Works
- Parse subtitles: tokenize Japanese text with MeCab morphological analysis.
- Filter words: keep content words (nouns, verbs, adjectives, adverbs) and drop words already in your Anki collection.
- Extract media: capture screenshots and audio clips from the video at each subtitle's timestamp via ffmpeg.
- Fetch definitions: look up English definitions from JMdict (offline) or the Jisho API.
- Create cards: batch upload to Anki via AnkiConnect.
Features
- Desktop GUI: cross-platform PyQt6 application.
- Batch processing: process a full anime series at once with automatic video/subtitle pairing.
- Offline dictionary: fast JMdict lookups with Jisho API fallback.
- Parallel media extraction: concurrent ffmpeg processes for speed.
- Preview mode: see which words would be mined without creating cards.
- Smart filtering: skips particles, pronouns, onomatopoeia, sound effects, and words you already know.
- Theming: four GUI themes (Light, Dark, Sakura, Tokyo Night) on a JSON-based system that supports custom themes.
- Analytics dashboard: track mining statistics, series difficulty rankings, and milestone achievements.
- Word curation: pick which discovered words to mine via an interactive dialog.
- Export: write results to CSV, TSV, or vocabulary list formats.
- Pitch accent data: optional pitch accent position and category fields on cards.
- Word frequency rankings: filter or annotate words by frequency rank.
- Known words database: persistent SQLite cache of known vocabulary, synced with Anki.
- Update checker: automatic check for new versions via GitHub Releases.
- Blacklist/whitelist: custom word lists to always include or exclude specific words.
- Cross-episode frequency analysis: prioritize words that appear across multiple episodes.
Installation
Requirements
- Python 3.10+: download.
- ffmpeg: must be on your PATH.
- macOS:
brew install ffmpeg - Ubuntu/Debian:
sudo apt install ffmpeg - Windows: download from ffmpeg.org and add to PATH.
- macOS:
- Anki with AnkiConnect installed.
- In Anki, go to Tools > Add-ons > Get Add-ons and paste code
2055492159. - Restart Anki. AnkiConnect runs in the background while Anki is open.
- In Anki, go to Tools > Add-ons > Get Add-ons and paste code
YouTube mining requires yt-dlp and psutil (installed automatically with the package). Both piggyback on the same ffmpeg you already have on PATH.
Install Anki Miner
Install with pipx (recommended, creates an isolated environment):
pipx install anki-miner
Don't have pipx? Install it first:
pip install pipx && pipx ensurepath, then restart your terminal.
Or install with pip directly:
pip install anki-miner
Download standalone executable (no Python required)
Download the latest release for your platform:
| Platform | Download |
|---|---|
| Windows | AnkiMiner-Windows-x86_64.zip |
| macOS | AnkiMiner-macOS-arm64.tar.gz |
| Linux | AnkiMiner-Linux-x86_64.tar.gz |
Note: You still need ffmpeg installed and Anki running with the AnkiConnect add-on.
Manual installation (from source)
git clone https://github.com/0xzerolight/anki_miner.git
cd anki_miner
python -m venv venv
source venv/bin/activate # Linux/macOS
# or: venv\Scripts\activate # Windows
pip install .
Desktop Shortcut
A desktop shortcut is created automatically on first launch. Re-run it anytime from Tools → Create Desktop Shortcut... inside the app.
- Linux: adds "Anki Miner" to your application menu.
- Windows: creates an "Anki Miner" shortcut on your Desktop and Start Menu.
Recommended Setup
These steps are optional but improve the experience.
Lapis Note Type
Anki Miner uses the Lapis note type fields by default (an open-source Anki note type for Japanese learning).
- Download the latest
.apkgfrom Lapis releases - In Anki, go to File > Import and select the
.apkgfile
The default field mapping:
| Anki Miner Field | Note Field | Content |
|---|---|---|
| word | Expression | Dictionary form of the word |
| sentence | Sentence | Original subtitle line |
| definition | MainDefinition | English definitions |
| picture | Picture | Screenshot from the video |
| audio | SentenceAudio | Audio clip of the sentence |
| expression_furigana | ExpressionFurigana | Word with furigana reading |
| sentence_furigana | SentenceFurigana | Sentence with furigana reading |
| pitch_position | (unmapped) | Pitch accent position number |
| pitch_category | (unmapped) | Pitch accent category |
| frequency | (unmapped) | Word frequency rank |
Fields marked (unmapped) have no default Lapis mapping. Map them in Settings if your note type supports them.
You can use a different note type by changing the field mappings in the GUI settings. As long as the note type contains all the 'Anki Miner' fields, it should work well with the app.
JMdict Offline Dictionary
For fast offline lookups, download JMdict:
mkdir -p ~/.anki_miner
wget -O ~/.anki_miner/JMdict_e.gz http://ftp.edrdg.org/pub/Nihongo/JMdict_e.gz
gunzip ~/.anki_miner/JMdict_e.gz
Without JMdict, Anki Miner falls back to the Jisho API (slower, requires internet, rate-limited).
Quick Start
Launch the desktop application:
anki_miner_gui
The GUI provides five tabs:
- Single Episode: mine one video/subtitle pair with file selectors and progress tracking.
- Batch Processing: queue multiple series for sequential processing.
- YouTube: paste a URL, fetch metadata, then mine (see below).
- Analytics: mining statistics dashboard with overview cards, recent sessions, series difficulty rankings, and milestone achievements.
- Settings: configure Anki connection, media extraction, dictionary, and word filtering options.
YouTube mining
Paste a YouTube URL, click Fetch Info to probe metadata (title, duration, sub availability), then click Mine. The fetch downloads the video plus its Japanese subtitle track into a per-run temp directory, then hands both files to the same pipeline used for file-based mining. Cards land in Anki the same way.
Auto-captions are accepted only when they are native Japanese. Tracks that YouTube generates by machine-translating from English (or another language) are filtered out — mining those produces garbage. Cards derived from native auto-captions may still be lower quality than cards from manual subtitles, since auto-captions have no sentence boundaries.
- Bot-detection prompts: if YouTube asks "Sign in to confirm you're not a bot", open Settings → Cookies → Browser and pick Firefox or Chrome. yt-dlp pulls cookies from that browser's profile on every fetch.
- Age-restricted videos: same fix — set the cookies-from-browser option to the browser you use to watch YouTube.
- Max duration: defaults to 120 minutes. The probe aborts before any download if the video is longer. Adjust in Settings.
Configuration
All settings can be adjusted in the GUI Settings tab. Here are the key options:
| Setting | Default | Description |
|---|---|---|
anki_deck_name |
"Anki Miner" |
Target Anki deck |
anki_note_type |
"Lapis" |
Note type to use |
audio_padding |
0.3 |
Seconds added before/after audio clips |
screenshot_offset |
1.0 |
Seconds after subtitle start for screenshot |
min_word_length |
2 |
Minimum characters per word |
max_parallel_workers |
6 |
Concurrent ffmpeg processes |
use_offline_dict |
true |
Use JMdict instead of Jisho API |
subtitle_offset |
0.0 |
Global subtitle timing adjustment |
use_pitch_accent |
false |
Enable pitch accent data on cards |
use_frequency_data |
false |
Enable word frequency ranking |
max_frequency_rank |
0 |
Frequency rank cutoff (0 = no filter) |
use_known_words_db |
false |
Persistent known word cache |
enable_history |
true |
Track mining history with undo support |
use_cross_episode |
false |
Prioritize cross-episode words |
min_episode_appearances |
2 |
Minimum episodes for cross-episode filter |
use_blacklist |
false |
Enable blacklist word filtering |
use_whitelist |
false |
Enable whitelist word filtering |
GUI settings are saved to ~/.anki_miner/gui_config.json.
Troubleshooting
| Issue | Solution |
|---|---|
| "Cannot connect to Anki" | Start Anki and ensure AnkiConnect is installed |
| "Deck not found" | Create the deck in Anki or update the deck name in settings |
| "Note type not found" | Import the Lapis note type (see Installation above) or configure your own |
| "ffmpeg not found" | Install ffmpeg and add to PATH |
| "JMdict file not found" | Download to ~/.anki_miner/ (see Installation above) or disable offline dictionary |
| Audio is wrong language | The tool tries Japanese audio tracks first, then falls back to the default track |
| Subtitles out of sync | Use the subtitle offset control in the GUI to adjust timing |
Issues and Contributing
Found a bug or have an idea for a feature? Open an issue. Bug reports and suggestions are welcome.
Pull requests are also welcome. See CONTRIBUTING.md for development setup and guidelines.
License
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anki_miner-2.3.0.tar.gz.
File metadata
- Download URL: anki_miner-2.3.0.tar.gz
- Upload date:
- Size: 327.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56ea8828fddab683c5d240954596c1b68dd408c19d4443eb52a721f4212cce11
|
|
| MD5 |
30ef0d57cfa408bc8b3669ce821ac16c
|
|
| BLAKE2b-256 |
2b3bfb3ec559a91e532af611fab2e22ac3e8162a5ee9e75ee2fd2d9b9cf01fc9
|
Provenance
The following attestation bundles were made for anki_miner-2.3.0.tar.gz:
Publisher:
publish.yml on 0xzerolight/anki_miner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anki_miner-2.3.0.tar.gz -
Subject digest:
56ea8828fddab683c5d240954596c1b68dd408c19d4443eb52a721f4212cce11 - Sigstore transparency entry: 1362556083
- Sigstore integration time:
-
Permalink:
0xzerolight/anki_miner@e5026603e8709598a86c70e0079cc8d6f2193711 -
Branch / Tag:
refs/tags/v2.3.0 - Owner: https://github.com/0xzerolight
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e5026603e8709598a86c70e0079cc8d6f2193711 -
Trigger Event:
push
-
Statement type:
File details
Details for the file anki_miner-2.3.0-py3-none-any.whl.
File metadata
- Download URL: anki_miner-2.3.0-py3-none-any.whl
- Upload date:
- Size: 372.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc93cd29b46fadc3cfd2180b1a98921108de3e073ab35f5f26a5a0d141568fa9
|
|
| MD5 |
3ec72b673f950395fce40679c2e159fc
|
|
| BLAKE2b-256 |
c79a309b33e17a342ccb5e00e6bf40a14812b0f7b788d36011d85743339ef857
|
Provenance
The following attestation bundles were made for anki_miner-2.3.0-py3-none-any.whl:
Publisher:
publish.yml on 0xzerolight/anki_miner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anki_miner-2.3.0-py3-none-any.whl -
Subject digest:
dc93cd29b46fadc3cfd2180b1a98921108de3e073ab35f5f26a5a0d141568fa9 - Sigstore transparency entry: 1362556156
- Sigstore integration time:
-
Permalink:
0xzerolight/anki_miner@e5026603e8709598a86c70e0079cc8d6f2193711 -
Branch / Tag:
refs/tags/v2.3.0 - Owner: https://github.com/0xzerolight
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e5026603e8709598a86c70e0079cc8d6f2193711 -
Trigger Event:
push
-
Statement type: