A package for random string detection

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

random-string-detector

This package helps you identify random strings within text data by analyzing the frequency of bigrams (two-letter combinations). It leverages the fact that certain bigrams are more common in natural language than others. By comparing the bigram frequencies in your text data to those in a reference language corpus, you can spot strings of characters that deviate from typical language patterns.

See Explanation for more information.

Features

Detect random strings of English and other languages.
Specify thresholds to control the sensitivity of the detection.
Supported Languages: English and Portuguese.

Do you want to add support for another language? Open an issue or a pull request.

See contributing section for more information.

Installation

You can install the package using pip:

pip install random-string-detector

Usage

Example 1

from random_string_detector import RandomStringDetector

detector = RandomStringDetector()
print(detector("Hello World"))  # False
print(detector("aowkaoskaos"))  # True

Example 2

from random_string_detector import RandomStringDetector

detector = RandomStringDetector(allow_numbers=True)
print(detector("Hello World"))  # False
print(detector("aowkaoskaos"))  # True
print(detector("aoekaoekaoe"))  # True
print(detector("aoekaoekaoe1d2e"))  # True
print(detector("Hello World 123"))  # False

Explanation

Using the fact that the expected number of 2-letter combinations in English is 676, and this includes combinations with identical letters and combinations with distinct letters, it is possible to use low-frequency bigrams in order to detect random strings of English letters.

As per Peter Norvig analysis, the most frequent bigram in English language is "th". On the other side, "zx" is not so common. By comparing the frequency of different bigrams in your text data to those in the English language corpus, you can identify strings of characters that do not fit typical language patterns.

This package contains a class named RandomStringDetector() and language-specific bigram frequency dictionaries that can be combined to detect random strings in English and other languages. The threshold value (between 0 and 100) can be used to control the sensitivity of the detection. Higher values represent more frequent bigrams (like "th") and lower values represent less frequent bigrams (like "zx").

Only words with length greater than 4 are considered.

The boolean allow_numbers argument (default False) will ignore numbers if they are present. This is useful if you are validating whether or not a username is valid, as often times these will include valid words and numbers - such as "chicagofan23".

Contributing

We happily accept any contributions and feedback. 😊

Adding support for a new language

To add support for a new language, you need to follow these steps:

Find a large text corpus in the language you want to add support for.
Compute the bigram frequencies for the corpus (see /notebooks/portuguese.ipynb for an example).
Add the bigram frequencies to /random_string_detector/bigrams/.py.
Import the bigram frequencies in /random_string_detector/bigrams/init.py.

Note: The bigram frequencies should be a dictionary with the bigram as the key and the normalized frequency as the value. The bigram should be a string with the two letters concatenated. The normalized frequency should be a float between 0 and 100.

If you have any questions, issues, or suggestions, please feel free to contact us.

License

This package is distributed under the MIT license.

See LICENSE for more information.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.2

Mar 12, 2024

1.0.1

Mar 12, 2024

1.0.0

Sep 22, 2023

0.0.8

May 10, 2023

0.0.7

May 10, 2023

0.0.6

May 10, 2023

0.0.5

May 9, 2023

0.0.4

May 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

random_string_detector-1.0.2.tar.gz (26.8 kB view hashes)

Uploaded Mar 12, 2024 Source

Built Distribution

random_string_detector-1.0.2-py3-none-any.whl (24.6 kB view hashes)

Uploaded Mar 12, 2024 Python 3

Hashes for random_string_detector-1.0.2.tar.gz

Hashes for random_string_detector-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`98cf7eb97a2fcdb87edda4638ccb51659857410dc2588c57504bab1134232977`
MD5	`05c8f1c37693a01118a94e6d091b9c89`
BLAKE2b-256	`a435220e202d65e47690b8e486da934aad276b704f51fca398bfcd018956ba31`

Hashes for random_string_detector-1.0.2-py3-none-any.whl

Hashes for random_string_detector-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0c3bc48868d74b354ce4ae714b04d26bb4283beeb97a5779b6f70588624520d`
MD5	`dd76ef0ac185c09b12273d29ac55899f`
BLAKE2b-256	`e095c9b8031295785d1ecbc093591b7f41eb4ce7b42c6a5771b4605352b59d49`