Coarse estimation of the probability of observing a string in a given body of text.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

TextProbability

This project provides utilities for roughly estimating the probability that a given string would be observed in a corpus that has some specified set of properties. No serious attempt is to be made to specify the theoretical significance of this so-called "probability." Instead, the use of "probabilities" as program output is motivated by one or two common use cases, such as

Distinguishing between mostly correct English text and randomly produced characters, or
Determining whether a very short snippet of text is English, German, or French, given that it comes from a corpus that is (say) 60% English, 30% German, and 10% French.

Please consider the use of the term "probability" as a pragmatic abuse of language that is used to make make certain calculations easier to explain.

Data

This project uses data collected from Wikipedia on the following languages:

German (de)
English (en)
Spanish (es)
French (fr)
Italian (it)
Portuguese (pt)
Turkish (tr)

Feel free to read the data collection logs to see what kinds of sources were used for language data.

This includes on the order of 10 MB of data per language. This incurs a one-time cost on program startup when data is initially loaded from JSON files. This quantity of data is from after summarizing the original data, a process which can reduce its size by up to an order of magnitude. It is not yet clear what the consequences are of summarizing the language data. For example, it may have a helpful de-noising effect, or it may adversely affect the quality of the model by limiting its worldly knowledge -- probably the latter.

Usage

To determine the language of a string:

from textprobability.classify import default_classifier

probabilities_by_language_with_default_priors = default_classifier(snippet)

The most probable language will be the argmax of the resulting map.

To determine a rough "probability" of observing a particular string in a corpus having some language:

bcp_47_langcode = "fr"
p_given_french = markov(bcp_47_langcode)  # The result is a function.
my_text = "le sigle"
probability_of_my_text = p_given_french(my_text)  # The result is a float in [0, 1].

To run examples, run:

python3 -m textprobability.examples.classification

Or:

python3 -m textprobability.examples.defaults

For help collecting new language data, run:

python3 -m textprobability.data.get_data --help

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.3

Mar 26, 2022

0.0.2

Mar 24, 2022

0.0.1

Mar 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textprobability-0.0.3.tar.gz (22.9 MB view details)

Uploaded Mar 26, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textprobability-0.0.3-py3-none-any.whl (23.8 MB view details)

Uploaded Mar 26, 2022 Python 3

File details

Details for the file textprobability-0.0.3.tar.gz.

File metadata

Download URL: textprobability-0.0.3.tar.gz
Upload date: Mar 26, 2022
Size: 22.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for textprobability-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`1758d1adca59dda23002d7a6a9a8d2167e500ed9f5645aa0c1c8990de449eb65`
MD5	`e093e3e50d09f58d86d8172914ae518d`
BLAKE2b-256	`f37da3dd33358b54ae4e0a1b3e3308d71e9e8b8e8106727bcc7160c360a9499d`

See more details on using hashes here.

File details

Details for the file textprobability-0.0.3-py3-none-any.whl.

File metadata

Download URL: textprobability-0.0.3-py3-none-any.whl
Upload date: Mar 26, 2022
Size: 23.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for textprobability-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c76a05442bff4ddb1f1cb170f4131a488b0bb56e36fa84edf5fad8cd7f01df8a`
MD5	`6dbbefc44858e17cc608af6ab4e090c8`
BLAKE2b-256	`a67a5e576417d63046366eb171e560bfaf37e07ef31cb4e75c3960c1564f986a`

See more details on using hashes here.

textprobability 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TextProbability

Data

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes