Skip to main content

Multilingual word frequency statistics for Python based on subtitles corpora

Project description

Statistics about word frequencies in different languages based on a corpus of movie subtitles as extracted by the Frequency Words ( project.

Currently supported languages (or language codes to be more precise :):

"da", "de", "el", "en", "es", "fr", "it", "nl", "no", "pl", "pt", "ro", "zh-CN" 

Usage Examples

Getting the info about a given word
>> from wordstats import Word
>> print (Word.stats('bleu', 'fr'))
bleu: (lang: fr, rank: 1521, freq: 9.42, imp: 9.42, diff: 0.03, klevel: 2)
Comparing the difficulty of two German words
>> from wordstats import Word
>> Word.stats('blauzungekrankenheit','de').difficulty > Word.stats('blau','de').difficulty
Top 10 most used words in Dutch
>> from wordstats import LanguageInfo
>> Dutch = LanguageInfo.load('nl')
>> print(Dutch.all_words()[:10])
['ik', 'je', 'het', 'de', 'dat', 'is', 'een', 'niet', 'en', 'van']
Words common across all the languages

Given that the corpus is based on subtitles, some common names have sliped in. The common_words() function returns a list.

>> from wordstats.common_words import common_words
>> for each in common_words():
>>     if len(each) > 9:
>>         print(each)
Words that are the same in Polish and Romanian
>> from wordstats import LanguageInfo
>> Polish = LanguageInfo.load("pl")
>> Romanian = LanguageInfo.load("ro")
>> for each in Polish.all_words():
>>     if each in Romanian.all_words():
>>         if len(each) > 5 and each not in common_words():
>>             print(each)


pip install wordstats

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for wordstats, version 1.0.7
Filename, size File type Python version Upload date Hashes
Filename, size wordstats-1.0.7.tar.gz (3.6 MB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page