A utility to extract vocabulary lists from manga.
Project description
Japanese Vocabulary Extractor
This script allows you to automatically scan through various types of japanese media and generate a csv with all contained words for studying. Currently supported formats are:
- Manga (as images)
- Subtitles (ASS/SRT files) from anime, shows or movies
- PDF and EPUB files
- Text (txt) files
It also allows you to automatically add the english definitions of each word to the CSV, as well as furigana if desired.
The resulting CSV can be imported to Anki (if you add the english definitions) or Bunpro. If you wish to add furigana to your anki deck, use this addon: https://ankiweb.net/shared/info/678316993
Installation
You need to have Python installed on your computer. I recommend using Python 3.12.
To install the Japanese Vocabulary Extractor, follow these steps:
- Open a terminal or command prompt on your computer.
- Type the following command and press Enter:
pip install japanese-vocabulary-extractor
This will download and install the necessary files for the tool to work.
Usage
To use the Japanese Vocabulary Extractor, follow these steps:
- Open a terminal or command prompt on your computer.
- Type the following command and press Enter:
jpvocab-extractor --type TYPE input_path
Replace TYPE
with the type of media you are scanning: 'manga', 'subtitle', 'pdf', 'epub', 'txt' or 'generic'.
Replace input_path
:
- For manga, provide a folder containing the images.
- For other types, provide the file or a folder with multiple files. Use quotation marks if the path has spaces.
This will create a vocab.csv
file with all the words found.
To add English definitions to the CSV, include the --add-english
option:
jpvocab-extractor --add-english --type TYPE input_path
If you wish to add furigana (in the current implementation just the reading of the whole word in hiragana) to the word, add the --furigana
option, just like the --add-english
option. They can also be combined.
For manga only: If you have a parent folder with multiple volumes in separate folders, add --parent
before the type:
jpvocab-extractor --parent --type manga input_path
Bonus: Using this script with manga will also generate .mokuro
and .html
files for each volume, allowing you to read the manga with selectable text in your browser. For more details, visit the mokuro GitHub page linked at the bottom.
Notices
If you run into errors, look into the mokuro repository linked at the bottom. There might be some issues with python version compatibility.
Also important: This script is not perfect. The text recognition can make mistakes and some of the extracted vocab can be wrong. If this proves to be a big issue I will look for a different method to parse vocabulary from the text. Do not be alarmed by the warning about words with no definition, these are likely names, hallucinations/mistakes by the OCR algorithm or chinese symbols (sometimes found in subtitles).
TODO
- Better furigana
- Separate outputs for each volume
- More advanced dictionary lookup functionality
- Support more input formats (Games, VNs?) Please suggest any you might want!
- Support other output formats
- Improve dictionary result accuracy to include one-character-kana words when translating to english (currently filtered out due to mostly useless answers)
Acknowledgements
This is hardly my work, I just stringed together some amazing libraries:
- mokuro, to extract lines of text from manga - https://github.com/kha-white/mokuro
- mecab-python3, to tokenize japanese text and extract the dictionary forms - https://github.com/SamuraiT/mecab-python3
- unidic_lite, for data necessary for mecab to work - https://github.com/polm/unidic-lite
- jamdict and jmdict, for the dictionary data - https://github.com/neocl/jamdict, https://www.edrdg.org/jmdict/j_jmdict.html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file japanese_vocabulary_extractor-0.7.0.tar.gz
.
File metadata
- Download URL: japanese_vocabulary_extractor-0.7.0.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f5b61a3be3ba7a13ca26a08c515a7e48e2c0df826a80c39547e16ae15776f27 |
|
MD5 | c827b603215cb40b74f01640ea7b32b7 |
|
BLAKE2b-256 | 91b035c02fc36c04c6ea054ea96614bf161191ad188d17e8e35aa94ee197643f |
File details
Details for the file japanese_vocabulary_extractor-0.7.0-py3-none-any.whl
.
File metadata
- Download URL: japanese_vocabulary_extractor-0.7.0-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9dd612ad1a0140a28524c38dc8e1483143092de521395e669be59ea86e954bf4 |
|
MD5 | 433bf9d1c2a1e7981c3088b3800396fb |
|
BLAKE2b-256 | 4214a7010c72b7025edb67cef27a8c6043667a363817cbe687e59ce4c396a2f5 |