Library returns word frequence (ipm) by almost all russian words
Project description
Description
Python library ruword_frequency
returns frequency (ipm - items per million) of russian words, case insensitive.
It based on huge collection of russian documents and prepared word frequency sources. Full list:
- Wikipedia dump, russian segment
- Flibusta dump, more then 200 Gb of texts
- Pyhlyi's library
- Новый частотный словарь русской лексики
- Словарь русской литературы from http://speakrus.ru/dict/index.htm
- Частотный словарь Марка фон Хагена see description
Word's ipm from all enumerated sources was extracted and mean values used. Full index contains more them 7 billions word forms including mistakes from raw data sources (unfortunately).
Requirements:
- Python 3
- Word index occupies near 50 Mb on hard disk and will be downloaded first time you invoke
frequency.load()
method
Installation
# TODO
Usage
from ruword_frequency import Frequency
freq = Frequency()
freq.load()
freq.ipm('привет')
>>> 53.51823806762695
freq.ipm('неттакогослова')
>>> 0.0
# get max ipm value. For weights normalization, for example
freq.max_ipm()
>>> 42329.2890625
# get list of most used words with ipm more then 10000
for w in freq.iterate_words(10000):
print(w)
For other useful methods see marisa-trie documentations.
Tree index available as freq.tree
Rebuild tree by yourself
from ruword_frequency.source_reader import SourceReader
reader = SourceReader()
# increase socket timeout, sometimes helpful for huge file downloading:
import socket
socket.setdefaulttimeout(60)
reader.download_all_sources()
tree = reader.build_tree_from_dictionaries()
reader.save_tree(tree)
# use it
freq = Frequency()
freq.ipm('привет')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ruword_frequency-0.0.1.tar.gz
.
File metadata
- Download URL: ruword_frequency-0.0.1.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5e8861142d5a51a10bb631e894936fcd88c097838c7401bc4629649ff034f43 |
|
MD5 | 82fb859da816b5a7c4f4f48a92d0d6c3 |
|
BLAKE2b-256 | d5966d52257dcef1522dea45efa9a05ff9e5e235f2141da9f3f26f5cf6023029 |
File details
Details for the file ruword_frequency-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: ruword_frequency-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f62af42acdce21d0d86fb98f69ab7380c792352fd962a9b0c66874919e2a3826 |
|
MD5 | 01dad46a07065763c36bfbcd42bb73a5 |
|
BLAKE2b-256 | cce2a429f2b183c1f6ecfba96062f01a440f3bc00d960fc8ef8cf63d72f2b669 |