Skip to main content

...

Project description

Python package Python versions Code Coverage

Helper functions to find word trends (i.e. extract tokens, lemmatize and filter).

Installation

pip/pip3 install -U git+https://github.com/adbar/shoten.git

Usage

Input

Two possibilities for input data:

  • XML-TEI files as generated by trafilatura:
    1. from shoten import gen_wordlist

    2. myvocab = gen_wordlist(mydir, ['de', 'en'])

  • TSV-file contaning a word list: word form + TAB + date (YYYY-MM-DD format) + possible 3rd column (source)
    1. from shoten import load_wordlist

    2. myvocab = load_wordlist(myfile, ['de', 'en'])

Language codes: optional list of languages to be considered for lemmatization, ordered by relevance. ISO 639-1 codes, see the list of supported languages.

Optional argument maxdiff: maximum number of days to consider (default: 1000, i.e. going back up to 1000 days from today).

Filters

from shoten.filters import *

  • hapax_filter(myvocab, freqcount=2): (default frequency: <= 2)

  • shortness_filter(myvocab, threshold=20): length threshold in percent of word lengths

  • frequency_filter(myvocab, max_perc=50, min_perc=.001): maximum and minimum frequencies in percent

  • oldest_filter(myvocab, threshold=50): discard the oldest words (threshold in percent)

  • freshness_filter(myvocab, percentage=10): keep the X% freshest words

  • ngram_filter(myvocab, threshold=90, verbose=False): retains X% words based on character n-gram frequencies; runs out of memory if the vocabulary is too large (8 GB RAM recommended)

  • sources_freqfilter(myvocab, threshold=2): remove words which are only present in less than x sources

  • sources_filter(myvocab, myset): only keep the words for which the source contains a string listed in the input set

  • wordlist_filter(myvocab, mylist, keep_words=False): keep or discard words present in the input list

Reduce vocabulary size with a filter:

myvocab = oldest_filter(myvocab)

They can be chained:

myvocab = oldest_filter(shortness_filter(myvocab))

Output

# print one-by-one
for word in sorted(myvocab):
    print(word)
# transfer to a list
results = [w for w in myvocab]

CLI

shoten --help

Additional information

Shoten = focal point in Japanese (焦点).

Project webpage: Webmonitor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shoten-0.1.0.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

shoten-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file shoten-0.1.0.tar.gz.

File metadata

  • Download URL: shoten-0.1.0.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for shoten-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c842a31fa261563a34243a0a778723619425990fb7624c414730e2021791f285
MD5 1df43badfc0180f55d3c568207bb12cf
BLAKE2b-256 bf10e83b5dbe5241de4acd5c4d847559f3637f18da535c5cfcfef5523cf6c2ed

See more details on using hashes here.

File details

Details for the file shoten-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: shoten-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for shoten-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ccb2c3a7cdd01fea74fe2329246cc3b2e29c84ae3332b7af6dc553b51b276a5
MD5 55a663c4ffcbdf34153598d205fedb64
BLAKE2b-256 6386a6dfeb02cd178ebd863898d5c4b681b8ac90b5a9fe08df9ceaf8eaae708f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page