Skip to main content

Create frequency dictionaries for yomichan out of a variety of media

Project description

yomidict

Create frequency dictionaries for yomichan out of media.
Currently supported formats are: epub, html, srt, ass, txt

pip install yomidict

MWE:

import yomidict
dm = yomidict.DictMaker()
filelist = ["test.html"]*5 + ["test.epub"]*2 + ["test.srt"]*2
dm.feed_files(filelist)
dm.save("zipfile.zip", "name_in_yomichan", use_suffix=True)

Docs:

DictMaker Object

wcounter

Is a Counter which saves the number of occurences for the tokens that were found during feeding.

refcounter

Keeps track of in how many files a certain token was found. E.g. a value of 0.5 (if normalized) would mean that the token occurs in 50% of all files that were fed.

DictMaker.feedfiles()

def feed_files(
        self,
        filelist,
        skip_errors=True,
        reset_refcounter=True,
        normalize_refcounter=True,
    )

skip_erros: does exactly as the name suggests, it skips errors. During processing of a bunch of files all sorts of errors could occur which would abort the feeding. This might be undesirable and so they can be skipped. The errored files will also be taken in consideration when calculating the DictMaker.refcounter.

reset_refcounter: resets the refcounter before feeding files.

normalize_refcounter: count/total_number_of_files. Therefore, if a token comes up in 8 out of 10 books the value of the counter would be 0.8 instead of 8. This makes it easier to read even without knowing the total number of files that were fed into DictMaker.

DictMaker.save()

def save(
        self,
        filepath,
        dictname,
        only_rank_and_freq=False,
        use_suffix=True,
        use_suffix_rank=False,
        use_suffix_freq=False,
    )

only_rank_and_freq: by default it the word rank, the word frequency and the refcounter_value get saved. This deactivates the refcounter_value.

use_suffix: activates use_suffix_rank and use_suffix_freq.

use_suffix_rank: if the number is above 1000 the number gets replaced by "num/1000 K" e.g. 2530 becomes 2K and 2434455 becomes 2M.

use_suffix_freq: same as use_suffix_freq but for the frequency

DictMaker.feed_text()

def feed_text(self, text, refcounter_add=False)

can be used to feed a string into DictMaker.

refcounter_add: If true it adds 1 occurrence in refcounter to all the tokens that were found in the fed text.

How to feed a large text file

Do you want to use refcounter? If yes, do you know the number of works inside the large text file? No? Don't use refcounter.

If you do know the number of works inside the large text file, do you know where one work ends and the other begins? Nice, just read it as chunks and let it add to the refcounter and normalize it in the end. If not, don't use refcounter.

To feed a large text file you can just read the text file line by line or sentence by sentence and utilizie the DictMaker._clean_txt() function.

dm = yomidict.DictMaker()
for line in large_txt_file:
    dm.feed_text(dm._clean_txt(line))

If you know the boundaries of each work and can it eat in chunks you could something like this:

dm = yomidict.DictMaker()
for work in large_txt_file:
    dm.feed_text(dm._clean_txt(work), refcounter_add=True)
dm.normalize_refcounter(works_in_large_txt_file)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yomidict-0.1.7.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

yomidict-0.1.7-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file yomidict-0.1.7.tar.gz.

File metadata

  • Download URL: yomidict-0.1.7.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.3 Windows/10

File hashes

Hashes for yomidict-0.1.7.tar.gz
Algorithm Hash digest
SHA256 5bd063a6da7df7c75fac02104296a3e3cc2017a3ead38c48292b4c56bc6316bc
MD5 12292c4689d6d94602263250bf99f72e
BLAKE2b-256 8fa556463445fde548ce197083252ebf510fe04a7583d7d22cdc43229957ae70

See more details on using hashes here.

File details

Details for the file yomidict-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: yomidict-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.3 Windows/10

File hashes

Hashes for yomidict-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8e7ddf6d6d3aac56fa903c04ebc9e818d55e32de1d128b067a60fe815286b3df
MD5 ca1f88173a257133ab1db17c5418bc79
BLAKE2b-256 da6b4929dbf02ade2658b5dbdf13b00d461bfd34dd470ce3082038dd6fa6ec9b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page