A utility to extract vocabulary lists from manga.
Project description
Japanese Vocabulary Extractor
This script allows you to automatically scan through various types of japanese media (currently manga and ebooks) and generate a csv with all contained words.
It is intended to be used with the community deck feature of Bunpro, hence the csv output format. Once the csv import feature will be published, I will adjust the format of the csv. If any other outputs are desired, let me know!
Installation
You need to have python installed (ideally Python 3.12).
Using pip
Install the package using
pip install japanese-vocabulary-extractor
Usage
main.py [-h] [--parent] --type TYPE input_path
Specify the type of media: 'manga', 'pdf', 'epub' or 'text'. Replace input_path with the path containing the files (or, if not a manga, the file directly). Make sure to surround it with quotation marks if there are spaces in the path!
Only for manga: If you enter a parent folder containing multiple volumes in their own folders, add "--parent" before the folder path.
This will generate a vocab.csv file containing all words.
Bonus: Since this script is using mokuro, you'll also generate a .mokuro and .csv file for each volume, allowing you to read the manga with selectable text in your browser. For more info, visit the mokuro github page linked at the bottom.
Notices
If you run into errors, look into the mokuro repository linked at the bottom. There might be some issues with python version compatibility.
Also important: This script is not perfect. The text recognition can make mistakes and some of the extracted vocab can be wrong. If this proves to be a big issue I will look for a different method to parse vocabulary from the text.
TODO
- Live Output from Mokuro (it can take very long)
- Separate outputs for each volume
- Added translations through dictionary lookup?
- Support more input formats (please suggest any you might want!)
- Support other output formats
Acknowledgements
This is hardly my work, I just stringed together some amazing libraries:
- mokuro, to extract lines of text from manga - https://github.com/kha-white/mokuro
- mecab-python3, to tokenize japanese text and extract the dictionary forms - https://github.com/SamuraiT/mecab-python3
- unidic_lite, for data necessary for mecab to work - https://github.com/polm/unidic-lite
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file japanese_vocabulary_extractor-0.3.2.tar.gz
.
File metadata
- Download URL: japanese_vocabulary_extractor-0.3.2.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a665f9f3f5d0828aab484e819b3a4820f10e73bc5260ffc50da9fe4049fcf829 |
|
MD5 | cce6b90dbb648f77d081dc532b6ecf19 |
|
BLAKE2b-256 | be72a3b79463e91e3fb00e56f92eaaa6034b216d25a7202087770815a68d19ed |
File details
Details for the file japanese_vocabulary_extractor-0.3.2-py3-none-any.whl
.
File metadata
- Download URL: japanese_vocabulary_extractor-0.3.2-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f7ca419435578557e6d3b93679c662f5343d09271b692a239a5abd2a153f366 |
|
MD5 | 58f4a1d15a480efaf0c29741f14e3a30 |
|
BLAKE2b-256 | ba8485f72fbfc4e201290971b6e7ea32b023c9c68977dfb016999c7aedeaee44 |