Skip to main content

A utility to extract vocabulary lists from manga.

Project description

Manga Wordlist Extractor

This script allows you to automatically scan through various types of japanese media (currently manga and ebooks) and generate a csv with all contained words.

It is intended to be used with the community deck feature of Bunpro, hence the csv output format. Once the csv import feature will be published, I will adjust the format of the csv. If any other outputs are desired, let me know!

Installation

You need to have python installed (ideally Python 3.12).

Using pip

Install the package using

pip install manga-wordlist-extractor

Using the source code directly

Download this repository (using the "code -> download zip" option above the files list at the top). Open a command prompt in the downloaded folder after extracting.

Run this to install all dependencies:

pip install -r requirements.txt

You can now run the tool from the src/main/main.py file.

Usage

main.py [-h] [--parent] --type TYPE input_path

Specify the type of media: 'manga', 'pdf', 'epub' or 'text'. Replace input_path with the path containing the files (or, if not a manga, the file directly). Make sure to surround it with quotation marks if there are spaces in the path!

Only for manga: If you enter a parent folder containing multiple volumes in their own folders, add "--parent" before the folder path.

This will generate a vocab.csv file containing all words.

Bonus: Since this script is using mokuro, you'll also generate a .mokuro and .csv file for each volume, allowing you to read the manga with selectable text in your browser. For more info, visit the mokuro github page linked at the bottom.

Notices

If you run into errors, look into the mokuro repository linked at the bottom. There might be some issues with python version compatibility.

Also important: This script is not perfect. The text recognition can make mistakes and some of the extracted vocab can be wrong. If this proves to be a big issue I will look for a different method to parse vocabulary from the text.

TODO

  • Live Output from Mokuro (it can take very long)
  • Separate outputs for each volume
  • Added translations through dictionary lookup?
  • Support more input formats (please suggest any you might want!)
  • Support other output formats

Acknowledgements

This is hardly my work, I just stringed together some amazing libraries:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

japanese_vocabulary_extractor-0.3.0.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file japanese_vocabulary_extractor-0.3.0.tar.gz.

File metadata

File hashes

Hashes for japanese_vocabulary_extractor-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0679c5b9ec4de638fd6cc1fd352548dd34c8a9c1b0d6f8a0c1b353c583fe50c9
MD5 3c866c1f682d98d4cba01a9c8d84fafd
BLAKE2b-256 ff29194750cd6e78f9bfa1a3dda7c35d0270565ca5b152a299492d42632bda48

See more details on using hashes here.

File details

Details for the file japanese_vocabulary_extractor-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for japanese_vocabulary_extractor-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b3d05249309a5c33518e866dd8f3ec90bffc96e4bab64bc64e4f89113c56698
MD5 58f4678a9d936177bbf32b60f6084c23
BLAKE2b-256 8b7574a140ff126e2c8c965435c90e172105e592e7739414f88949db6e52befb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page