Skip to main content

Coarse estimation of the probability of observing a string in a given body of text.

Project description

TextProbability

This project provides utilities for roughly estimating the probability that a given string would be observed in a corpus that has some specified set of properties. No serious attempt is to be made to specify the theoretical significance of this so-called "probability." Instead, the use of "probabilities" as program output is motivated by one or two common use cases, such as

  • Distinguishing between mostly correct English text and randomly produced characters, or
  • Determining whether a very short snippet of text is English, German, or French, given that it comes from a corpus that is (say) 60% English, 30% German, and 10% French.

Please consider the use of the term "probability" as a pragmatic abuse of language that is used to make make certain calculations easier to explain.

Data

This project uses data collected from Wikipedia on the following languages:

  • German (de)
  • English (en)
  • Spanish (es)
  • French (fr)
  • Italian (it)
  • Portuguese (pt)
  • Turkish (tr)

Feel free to read the data collection logs to see what kinds of sources were used for language data.

This includes on the order of 10 MB of data per language. This incurs a one-time cost on program startup when data is initially loaded from JSON files. This quantity of data is from after summarizing the original data, a process which can reduce its size by up to an order of magnitude. It is not yet clear what the consequences are of summarizing the language data. For example, it may have a helpful de-noising effect, or it may adversely affect the quality of the model by limiting its worldly knowledge -- probably the latter.

Usage

To determine the language of a string:

from textprobability.classify import default_classifier

probabilities_by_language_with_default_priors = default_classifier(snippet)

The most probable language will be the argmax of the resulting map.

To determine a rough "probability" of observing a particular string in a corpus having some language:

bcp_47_langcode = "fr"
p_given_french = markov(bcp_47_langcode)  # The result is a function.
my_text = "le sigle"
probability_of_my_text = p_given_french(my_text)  # The result is a float in [0, 1].

To run examples, run:

python3 -m textprobability.examples.classification

Or:

python3 -m textprobability.examples.defaults

For help collecting new language data, run:

python3 -m textprobability.data.get_data --help

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textprobability-0.0.3.tar.gz (22.9 MB view hashes)

Uploaded Source

Built Distribution

textprobability-0.0.3-py3-none-any.whl (23.8 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page