Skip to main content

Library returns word frequence (ipm) by almost all russian words

Project description

Description

Python library ruword_frequency returns frequency (ipm - items per million) of russian words, case insensitive. It based on huge collection of russian documents and prepared word frequency sources. Full list:

Word's ipm from all enumerated sources was extracted and mean values used. Full index contains more them 7 billions word forms including mistakes from raw data sources (unfortunately).

Requirements:

  • Python 3
  • Word index occupies near 50 Mb on hard disk and will be downloaded first time you invoke frequency.load() method

Installation

# TODO

Usage

from ruword_frequency import Frequency
freq = Frequency()
freq.load()

freq.ipm('привет')
>>> 53.51823806762695

freq.ipm('неттакогослова')
>>> 0.0

# get max ipm value. For weights normalization, for example
freq.max_ipm()
>>> 42329.2890625

# get list of most used words  with ipm more then 10000
for w in freq.iterate_words(10000):
    print(w)

For other useful methods see marisa-trie documentations. Tree index available as freq.tree

Rebuild tree by yourself

from ruword_frequency.source_reader import SourceReader
reader = SourceReader()

# increase socket timeout, sometimes helpful for huge file downloading:
import socket
socket.setdefaulttimeout(60)

reader.download_all_sources()
tree = reader.build_tree_from_dictionaries()
reader.save_tree(tree)

# use it 
freq = Frequency()
freq.ipm('привет')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruword_frequency-0.0.1.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

ruword_frequency-0.0.1-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file ruword_frequency-0.0.1.tar.gz.

File metadata

  • Download URL: ruword_frequency-0.0.1.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.2

File hashes

Hashes for ruword_frequency-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b5e8861142d5a51a10bb631e894936fcd88c097838c7401bc4629649ff034f43
MD5 82fb859da816b5a7c4f4f48a92d0d6c3
BLAKE2b-256 d5966d52257dcef1522dea45efa9a05ff9e5e235f2141da9f3f26f5cf6023029

See more details on using hashes here.

File details

Details for the file ruword_frequency-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: ruword_frequency-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.2

File hashes

Hashes for ruword_frequency-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f62af42acdce21d0d86fb98f69ab7380c792352fd962a9b0c66874919e2a3826
MD5 01dad46a07065763c36bfbcd42bb73a5
BLAKE2b-256 cce2a429f2b183c1f6ecfba96062f01a440f3bc00d960fc8ef8cf63d72f2b669

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page