...
Project description
Helper functions to find word trends (i.e. extract tokens, lemmatize and filter).
Installation
pip/pip3 install -U git+https://github.com/adbar/shoten.git
Usage
Input
Two possibilities for input data:
- XML-TEI files as generated by trafilatura:
from shoten import gen_wordlist
myvocab = gen_wordlist(mydir, ['de', 'en'])
- TSV-file contaning a word list: word form + TAB + date (YYYY-MM-DD format) + possible 3rd column (source)
from shoten import load_wordlist
myvocab = load_wordlist(myfile, ['de', 'en'])
Language codes: optional list of languages to be considered for lemmatization, ordered by relevance. ISO 639-1 codes, see the list of supported languages.
Optional argument maxdiff: maximum number of days to consider (default: 1000, i.e. going back up to 1000 days from today).
Filters
from shoten.filters import *
hapax_filter(myvocab, freqcount=2): (default frequency: <= 2)
shortness_filter(myvocab, threshold=20): length threshold in percent of word lengths
frequency_filter(myvocab, max_perc=50, min_perc=.001): maximum and minimum frequencies in percent
oldest_filter(myvocab, threshold=50): discard the oldest words (threshold in percent)
freshness_filter(myvocab, percentage=10): keep the X% freshest words
ngram_filter(myvocab, threshold=90, verbose=False): retains X% words based on character n-gram frequencies; runs out of memory if the vocabulary is too large (8 GB RAM recommended)
sources_freqfilter(myvocab, threshold=2): remove words which are only present in less than x sources
sources_filter(myvocab, myset): only keep the words for which the source contains a string listed in the input set
wordlist_filter(myvocab, mylist, keep_words=False): keep or discard words present in the input list
Reduce vocabulary size with a filter:
myvocab = oldest_filter(myvocab)
They can be chained:
myvocab = oldest_filter(shortness_filter(myvocab))
Output
# print one-by-one
for word in sorted(myvocab):
print(word)
# transfer to a list
results = [w for w in myvocab]
CLI
shoten --help
Additional information
Shoten = focal point in Japanese (焦点).
Project webpage: Webmonitor.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file shoten-0.1.0.tar.gz
.
File metadata
- Download URL: shoten-0.1.0.tar.gz
- Upload date:
- Size: 29.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c842a31fa261563a34243a0a778723619425990fb7624c414730e2021791f285 |
|
MD5 | 1df43badfc0180f55d3c568207bb12cf |
|
BLAKE2b-256 | bf10e83b5dbe5241de4acd5c4d847559f3637f18da535c5cfcfef5523cf6c2ed |
File details
Details for the file shoten-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: shoten-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ccb2c3a7cdd01fea74fe2329246cc3b2e29c84ae3332b7af6dc553b51b276a5 |
|
MD5 | 55a663c4ffcbdf34153598d205fedb64 |
|
BLAKE2b-256 | 6386a6dfeb02cd178ebd863898d5c4b681b8ac90b5a9fe08df9ceaf8eaae708f |