Skip to main content

Add noise to text at the character level

Project description

textnoisr: Adding random noise to a dataset

build-doc code-style nightly-test unit-test

textnoisr is a python package that allows to add random noise to a text dataset, and to control very accurately the quality of the result.

You can install it using pip:

pip install textnoisr

Here is an example if your dataset consists on the first few lines of the Zen of python:

Raw text

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
...

Noisy text

TheO Zen of Python, by Tim Pfter

BzeautiUful is ebtter than ugly.
Eqxplicin is better than imlicit.
Simple is beateUr than comdplex.
Complex is better than comwlicated.
Flat is bejAter than neseed.
...

Four types of "actions" are implemented:

  • insert a random character, e.g. STEAM → STREAM,
  • delete a random character, e.g. STEAM → TEAM,
  • substitute a random character, e.g. STEAM → STEAL.
  • swap two consecutive characters, e.g. STEAM → STEMA

The general philosophy of the package is that only one single parameter is needed to control the noise level. This "noise level" is applied character-wise, and corresponds roughly to the probability for a character to be impacted.

More precisely, this noise level is calibrated so that the Character Error Rate of a noised dataset converges to this value as the amount of text increases.

Why a whole package for such a simple task?

In the case of inserting, deleting and substituting characters at random with a probability $p$, the Character Error Rate is only the average number of those operations, so it will converge to the input value $p$ due to the Law of Large Numbers.

However, the case of swapping consecutive characters is not trivial at all for two reasons:

  • First, swapping two characters is not an "atomic operation" with respect to the Character Error Rate metric.

  • Second, we do not want to swap repeatedly the same character over and over again if the probability to apply the swap action is high:
    STEAM → TSEAM
    TSEAM → TESAM
    TESAM → TEASM
    TEASM → TEAMS
    This would be equivalent to STEAM → TEAMS, so this cannot be considered "swapping consecutive characters". To avoid this behavior, we must avoid swapping a character if it has just been swapped. This breaks the independency between one character and the following one, and makes the Law of Large Numbers not applicable.

We use Markov Chains to model the swapping of characters. This allows us to compute and correct the corresponding bias in order to make itstraightforward for the user to get the desired Character Error Rate, as if the Law of Large Number could beapplied!

All the details of this unbiasing are here. The goal of this package is for the user to be confident on the result without worrying about the implementation details.


The documentation follows this plan:

  • You may want to follow a quick tutorial to learn the basics of the package,
  • The Results page illustrates how no calibration is needed in order to add noise to a corpus with a target Character Error Rate.
  • The How this works section explains the mechanisms, and some design choices of this package. We have been extra careful to explain how some statistical bias have been avoided, for the package to be both user-friendly and correct. A dedicated page deeps dive in the case of the swap action.
  • The API Reference details all the technical descriptions needed.

There is also a Medium article about this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textnoisr-1.1.2.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textnoisr-1.1.2-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file textnoisr-1.1.2.tar.gz.

File metadata

  • Download URL: textnoisr-1.1.2.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for textnoisr-1.1.2.tar.gz
Algorithm Hash digest
SHA256 9e5091e28e1ecce9e8a9e928c722a76a43395b0ba9d5505777b62078ffe55e28
MD5 e6727a17e1c6f2e5d7e28ae0bcfd3814
BLAKE2b-256 43e39f2320cfb7b4a0a464a43aeffe96ca1108373bdc31f7dc7ec3c1caa76f01

See more details on using hashes here.

File details

Details for the file textnoisr-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: textnoisr-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for textnoisr-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c8e19410543c29951f3dd3876843f72a3661220f3be7fc90c61a72bc98270f10
MD5 7e64531154a7172bd5d2a33150d91ce7
BLAKE2b-256 ddf2c7205a073ba4b6c4e566bb310c8d7558e2ac724065b4bb99933c24f70802

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page