Skip to main content

Tools for analysing Zipf's law from text samples

Project description

Tools in python for analysing Zipf’s law from text samples.

This can be installed as a package from the python3 package library using the terminal command:

>>> pip3 install zipfanalysis

Usage

The package can be used from within python scripts to estimate Zipf exponents, assuming a simple power law model for word frequencies and ranks. To use the pacakge import it using

import zipfanalysis

Simple Method

The easiest way to carry out an analysis on a book or text file, using different estimators, is:

alpha_clauset = zipfanalysis.clauset("path_to_book.txt")

alpha_pdf = zipfanalysis.ols_pdf("path_to_book.txt", min_frequency=3)

alpha_cdf = zipfanalysis.ols_cdf("path_to_book.txt", min_frequency=3)

alpha_abc = zipfanalysis.abc("path_to_book.txt")

In Depth Method

Convert a book or text file to the frequency of words, ranked from highest to lowest:

word_counts = zipfanalysis.preprocessing.preprocessing.get_rank_frequency_from_text("path_to_book.txt")

Carry out different types of analysis to fit a power law to the data:

# Clauset et al estimator
alpha_clauset = zipfanalysis.estimators.clauset.clauset_estimator(word_counts)

# Ordinary Least Squares regression on log(rank) ~ log(frequency)
# Optional low frequency cut-off
alpha_pdf = zipfanalysis.estimators.ols_regression_pdf.ols_regression_pdf_estimator(word_counts, min_frequency=2)

# Ordinary least squares regression on the complemantary cumulative distribution function of ranks
# OLS on log(P(R>rank)) ~ log(rank)
# Optional low frequency cut-off
alpha_cdf = zipfanalysis.estimators.ols_regression_cdf.ols_regression_cdf_estimator(word_counts)

# Approximate Bayesian computation (regression method)
# Assumes model of p(rank) = C prob_rank^(-alpha)
# prob_rank is a word's rank in an underlying probability distribution
alpha_abc = zipfanalysis.estimators.approximate_bayesian_computation.abc_estimator(word_counts)

Development - Next Steps

  1. Speed up abc. Current bottleneck is sampling from infinite power law. Could be sped up by considering we only need the frequency vector of ranks, not the whole sample. So for example could sample from unoform distribution then drop values into interger ranked buckets based on inverse CDF.

  2. Build in frequency rank analysis. Convert to frequency counts representation, then carry out fit on that.

  3. Add significance testing

  4. Add ability to calcaulte x_min and truncated power laws.

  5. Speed up OLS on the cdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zipfanalysis-0.5.tar.gz (9.2 kB view details)

Uploaded Source

File details

Details for the file zipfanalysis-0.5.tar.gz.

File metadata

  • Download URL: zipfanalysis-0.5.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.9

File hashes

Hashes for zipfanalysis-0.5.tar.gz
Algorithm Hash digest
SHA256 632d7ee817a0730a4f4566dde60296dc15fc9fd56070e19be2bf8f253b846742
MD5 60d3ad42687f886fb1480b21b20d4f51
BLAKE2b-256 608f793fe0b3ff2cc62684d58ac8fa32dba2e24a0ca7509a09988773b7f4656c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page