A utility to extract vocabulary lists from manga.
Project description
Manga Wordlist Extractor
This script allows you to automatically scan through manga and generate a csv with all contained words.
It is intended to be used with the community deck feature of Bunpro, hence the csv format. Once the csv import feature will be published, I will adjust the format of the csv. If any other outputs are desired, let me know!
Usage
You need to have python installed (ideally Python 3.12).
Download this repository (using the "code -> download zip" option above the files list at the top). Open a command prompt in the downloaded folder after extracting.
Run this to install all dependencies:
pip install -r requirements.txt
Once this is done, navigate to the src/main folder in your command prompt. You can now run the tool with this command:
python main.py "FOLDER_PATH"
Replace FOLDER_PATH with the path containing the manga files. If you enter a parent folder containing multiple volumes, add "--parent" before the folder path.
This will generate a vocab.csv file containing all words.
Notices
If you run into errors, look into the mokuro repository linked at the bottom. There might be some issues with python version compatibility.
Also important: This script is not perfect. The text recognition can make mistakes and some of the extracted vocab can be wrong. If this proves to be a big issue I will look for a different method to parse vocabulary from the text.
TODO
- Upload to PyPi and make usage much easier and simpler
- Live Output from Mokuro (it can take very long)
- Separate outputs for each volume
- Added translations through dictionary lookup?
Acknowledgements
This is hardly my work, I just stringed together some amazing libraries:
- mokuro, to extract lines of text from manga - https://github.com/kha-white/mokuro
- nagisa, to extract words from those lines of text - https://pypi.org/project/nagisa/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file manga_wordlist_extractor-0.1.11.tar.gz
.
File metadata
- Download URL: manga_wordlist_extractor-0.1.11.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f171f943957920fafaa1f7df1237e3de75ea0d5d3aab21ab676911e5e6822f56 |
|
MD5 | 67283704ef4944f269c59fa57b96ee81 |
|
BLAKE2b-256 | c709f4dfdd8acdb6bcc8c6b5846f7232215d0c7071b9b7996b5dc85eeae93528 |
File details
Details for the file manga_wordlist_extractor-0.1.11-py3-none-any.whl
.
File metadata
- Download URL: manga_wordlist_extractor-0.1.11-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50018c660b6198861d0ce06d68cdb38baafc2686628611233e08bdbe07201df7 |
|
MD5 | 921297bbef90c9e6e6daf3f461d663e8 |
|
BLAKE2b-256 | 3b55120bfc10bfe28ba6d446429e9733d3a26f0436b763c391b61411345fe9f7 |