Skip to main content

A python library for error generation in dataset for machine learning

Project description

Pucktrick

Pucktrick is a Python library that provides utility functions for introducing errors in your dataframe. The library's name is based on Puck. Puck is the name of the elf in the “A Midsummer Night’s Dream” of William Shakespeare, who is very famous for causing trouble and playing tricks on mortals and other fairies alike.

Features

Pucktrick is organized in modules, one for each error type. Each module includes a main function (or a class injector) that receives as parameters the dataset to modify, the strategy dictionary, and the original dataset if mode="extended". Functions return two parameters: an error code (0 for success, 1 for failure/no modifications) and the generated dataset.

The Strategy Configuration

The core of Pucktrick is the strategy configuration, which is passed as a JSON object or a Python dictionary. It allows you to precisely define the error model.

Base Parameters

{
  "affected_features": ["column1", "column2"],
  "selection_criteria": "all",
  "percentage": 0.2,
  "mode": "new",
  "perturbate_data": {
    "sampling": "random"
  }
}
  • affected_features: A list of strings specifying the columns to be corrupted.
  • selection_criteria: A predicate (e.g., "age > 30") to target specific rows, or "all" to target the entire dataset.
  • percentage: A float (0.0 to 1.0) indicating the proportion of targeted rows to corrupt.
  • mode:
    • "new": Applies errors to a clean dataset.
    • "extended": Incrementally adds errors to a previously corrupted dataset, reading the original_df to avoid double-corrupting rows.
  • perturbate_data: A dictionary containing the noise injection logic.
    • sampling: How rows are chosen ("random", "uniform", "normal", "exponential").

Modules & Specific Configurations

1. Missing (missing.py)

Replaces values with NaN. Specifics: No special parameters required in perturbate_data.

2. Outliers (outliers.py)

Injects outliers using a 3-sigma rule for continuous numeric data, domain expansion for categorical integers, or specific string tokens for text. Specifics: No special parameters required in perturbate_data.

3. Duplicated (duplicated.py)

Duplicates existing rows and optionally applies text transformations. Specifics: Set "function" in the main strategy to apply text transformations like "shuffle_words", "abbreviate_text", "replace_punctuation", "remove_replace", or "upper_lower".

4. Noisy (noisy.py)

Adds random noise or a systematic shift to data (numeric, string, or datetime). Specifics: In perturbate_data, set "distribution": "shift" to apply systematic shifting. You must provide a "param" dictionary:

  • "shift_value": Numeric value to add (or days for dates).
  • "shift_unit": "absolute" or "std" (standard deviations).
  • "shift_sign": "positive", "negative", or "random". (Use "distribution": "random" for standard uniform noise).

5. Labels (labels.py)

Flips labels for binary or multi-class classification. Specifics: For multi-class labels in perturbate_data, set "noise_model" to:

  • "NCAR" (Noise Completely At Random): Uniform random flip.
  • "NAR" (Noise At Random): Class-dependent flip. Provide "flip_distribution" in param.
  • "NNAR" (Nearest Neighbor At Random): Flips labels of instances close to decision boundaries. Provide "features_for_similarity" in param.

Version

version 0.6.0.1

  • Codebase fully refactored using Object-Oriented Programming (OOP) with the Template Method Pattern.
  • Added systematic shift ("distribution": "shift") error type to the noisy module.
  • Standardized the strategy interface and improved the extended mode logic across all modules.

version 0.5.1

  • add multiclass definition

version 0.5

  • add strategy, a JSON file where it is possible to create an error model by specifying the affected features (from one to many), a selection criterion, a Boolean predicate that specifies a subset of the rows to be corrupted, the mode, the percentage, the distribution function for injection errors.

version 0.4

  • errortype added: missing values

version 0.3 -error type added: duplicated

version 0.2

  • error type inserted: outliers

version 0.1

  • error type inserted: noisy error and inconsistency labels

Installation

You can install pucktrick using pip:

pip install pucktrick

References

Contributing

We welcome contributions from the community. To contribute:

Fork the repository Create a new branch (git checkout -b feature/your-feature) Commit your changes (git commit -am 'Add new feature') Push to the branch (git push origin feature/your-feature) Create a new Pull Request Please ensure your code adheres to our coding standards and includes appropriate tests.

License This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) - see the LICENSE file for details.

Acknowledgements Thanks to the contributors and open-source community for their support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pucktrick-0.6.0.1.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pucktrick-0.6.0.1-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file pucktrick-0.6.0.1.tar.gz.

File metadata

  • Download URL: pucktrick-0.6.0.1.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for pucktrick-0.6.0.1.tar.gz
Algorithm Hash digest
SHA256 7aa37795e698b6153d83620e5bf34918c966e0de5f05181977b874e4713fc322
MD5 4973a5487f2482be701730abbf026e3f
BLAKE2b-256 7f22559d7ccb42cdb0060c6169f444f45e17ea6c554f13e9fea09fc64eb581f8

See more details on using hashes here.

File details

Details for the file pucktrick-0.6.0.1-py3-none-any.whl.

File metadata

  • Download URL: pucktrick-0.6.0.1-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for pucktrick-0.6.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5593c625d40f04fa4eca8417c48e0090d3c2f1a90414e11f5c992e5a97f9ddea
MD5 67adad5ab9e914212da9ffa85c1d5052
BLAKE2b-256 f386d53e778c5faeebb885712a3bca3a5fdf02d9cf8c46406ebb0c2d20a1fb53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page