Skip to main content

Multilingual word frequency statistics for Python based on subtitles corpora

Project description

Statistics about word frequencies in different languages based on a corpus of movie subtitles as extracted by the Frequency Words (https://github.com/hermitdave/FrequencyWords) project.

Currently supported languages (or language codes to be more precise :):

"da", "de", "el", "en", "es", "fr", "it", "nl", "no", "pl", "pt", "ro", "zh-CN" 

Usage Examples

Getting the info about a given word
>> from wordstats import Word
>> print (Word.stats('bleu', 'fr'))
bleu: (lang: fr, rank: 1521, freq: 9.42, imp: 9.42, diff: 0.03, klevel: 2)
Comparing the difficulty of two German words
>> from wordstats import Word
>> Word.stats('blauzungekrankenheit','de').difficulty > Word.stats('blau','de').difficulty
True
Top 10 most used words in Dutch
>> from wordstats import LanguageInfo
>> Dutch = LanguageInfo.load('nl')
>> print(Dutch.all_words()[:10])
['ik', 'je', 'het', 'de', 'dat', 'is', 'een', 'niet', 'en', 'van']
Words common across all the languages

Given that the corpus is based on subtitles, some common names have sliped in. The common_words() function returns a list.

>> from wordstats.common_words import common_words
>> for each in common_words():
>>     if len(each) > 9:
>>         print(each)
washington
christopher
enterprise
Words that are the same in Polish and Romanian
>> from wordstats import LanguageInfo
>> Polish = LanguageInfo.load("pl")
>> Romanian = LanguageInfo.load("ro")
>> for each in Polish.all_words():
>>     if each in Romanian.all_words():
>>         if len(each) > 5 and each not in common_words():
>>             print(each)
telefon
moment
prezent
interes
...

Installation

pip install wordstats

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordstats-1.0.7.tar.gz (3.6 MB view details)

Uploaded Source

File details

Details for the file wordstats-1.0.7.tar.gz.

File metadata

  • Download URL: wordstats-1.0.7.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.1

File hashes

Hashes for wordstats-1.0.7.tar.gz
Algorithm Hash digest
SHA256 121b1fdfe46e5751137cd084f9ed556cf9ea61356ffa778ec70025f3f1d61923
MD5 51855951c26390781357b354ccce014a
BLAKE2b-256 7f798f10b9836162ba987f85cd92d61993e4c0bc5ad408be1b7b217ec5a4a6fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page