Skip to main content

A utility to extract vocabulary lists from manga.

Project description

Japanese Vocabulary Extractor

This script allows you to automatically scan through various types of japanese media and generate a csv with all contained words for studying. Currently supported formats are:

  • Manga (as images)
  • Subtitles (ASS/SRT files) from anime, shows or movies
  • PDF and EPUB files
  • Text files

It also allows you to automatically add the english definitions of each word to the CSV.

The resulting CSV can be imported to Anki (if you add the english definitions) or Bunpro.

Installation

You need to have python installed (ideally Python 3.12).

Install the package using

pip install japanese-vocabulary-extractor

Usage

jpvocab-extractor [-h] [--parent] [--add-english] --type TYPE input_path

Specify the type of media: 'manga', 'subtitle', 'pdf', 'epub' or 'text'. Replace input_path with the path containing the files (or, if not a manga, the file directly). Make sure to surround it with quotation marks if there are spaces in the path!

This will generate a vocab.csv file containing all words. If you wish to add definitions in the secon column of the CSV, add the "--add-english" argument.

Only for manga: If you enter a parent folder containing multiple volumes in their own folders, add "--parent" before the type.

Bonus: Since this script is using mokuro, you'll also generate a .mokuro and .html file for each volume, allowing you to read the manga with selectable text in your browser. For more info, visit the mokuro github page linked at the bottom.

Notices

If you run into errors, look into the mokuro repository linked at the bottom. There might be some issues with python version compatibility.

Also important: This script is not perfect. The text recognition can make mistakes and some of the extracted vocab can be wrong. If this proves to be a big issue I will look for a different method to parse vocabulary from the text. Do not be alarmed by the warning about words with no definition, these are likely names, hallucinations/mistakes by the OCR algorithm or chinese symbols (sometimes found in subtitles).

TODO

  • Separate outputs for each volume
  • More advanced dictionary lookup functionality
  • Support more input formats (Games, VNs?) Please suggest any you might want!
  • Support other output formats

Acknowledgements

This is hardly my work, I just stringed together some amazing libraries:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

japanese_vocabulary_extractor-0.6.0.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file japanese_vocabulary_extractor-0.6.0.tar.gz.

File metadata

File hashes

Hashes for japanese_vocabulary_extractor-0.6.0.tar.gz
Algorithm Hash digest
SHA256 234d0795f6d4166d03aee88d13528b4998ab029708feedb94d40ed7b68cba64c
MD5 dabbccbb85dfdb2ebc73bebfd0742fee
BLAKE2b-256 7aa924025aa057f0a2b3daf4f0f788ab6e68f603482bb9535448b8a52acf2b32

See more details on using hashes here.

File details

Details for the file japanese_vocabulary_extractor-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for japanese_vocabulary_extractor-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7fe03428322d9012e98852aa4b5acd417326635abea54a28becba71cfceb759e
MD5 35999624a8b68f0484b0032b0b35e1bf
BLAKE2b-256 ffd8beaa0d4f3c45601978ffdd20962634e05814e8adb89a1a03dd1e8ad21f4e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page