Skip to main content

A python library for error generation in dataset for machine learning

Project description

Pucktrick

Pucktrick is a Python library that provides utility functions for introducing errors in your dataframe. The library's name is based on Puck. Puck is the name of the elf in the "A Midsummer Night's Dream" of William Shakespeare, who is very famous for causing trouble and playing tricks on mortals and other fairies alike.

Features

Pucktrick is organized in modules, one for each error type. Each module includes a main function (or a class injector) that receives as parameters the dataset to modify, the strategy dictionary, and the original dataset if mode="extended" or mode="composed". Functions return two parameters: an error descriptor and the generated dataset.

For standard error modules, the error descriptor is an integer code (0 = success, 1 = no modifications applied). For the drift module, it is a dictionary with richer information (see Drift Simulation).

The Strategy Configuration

The core of Pucktrick is the strategy configuration, which is passed as a JSON object or a Python dictionary. It allows you to precisely define the error model.

Base Parameters

{
  "affected_features": ["column1", "column2"],
  "selection_criteria": "all",
  "percentage": 0.2,
  "mode": "new",
  "perturbate_data": {
    "sampling": "random"
  }
}
  • affected_features: A list of strings specifying the columns to be corrupted.
  • selection_criteria: A predicate (e.g., "age > 30") to target specific rows, or "all" to target the entire dataset.
  • percentage: A float (0.0 to 1.0) indicating the proportion of targeted rows to corrupt.
  • mode:
    • "new": Applies errors independently to the clean baseline dataset D0. Each call is stateless.
    • "extended": Incrementally adds errors to a previously corrupted dataset. Reads original_df (the clean D0) to identify rows already modified and adds corruption only to unmodified rows, up to the cumulative percentage target. No row is corrupted twice.
    • "composed": Applies errors exclusively to rows that have already been modified by a previous operator, using original_df (the clean D0) to identify them via a row-level, NaN-aware comparison across all columns. The percentage parameter controls what fraction of the already-modified set to corrupt. This enables cross-type corruption pipelines where heterogeneous errors are layered on the same row subset.
  • perturbate_data: A dictionary containing the noise injection logic.
    • sampling: How rows are chosen ("random", "uniform", "normal", "exponential").

Accumulation Modes: Summary

Mode Eligible rows percentage applies to Requires original_df
new All rows Full eligible set No
extended Rows not yet modified Full eligible set (cumulative) Yes
composed Rows already modified in any column Already-modified set Yes

Example: Composed Pipeline

from pucktrick.missing import missing
from pucktrick.outliers import outlier

# Step 1 — inject missing values on c1 (20% of rows), mode="new"
strategy_s1 = {
    "affected_features": ["c1"],
    "selection_criteria": "c1 == c1",
    "percentage": 0.20,
    "mode": "new",
    "perturbate_data": {"sampling": "random"}
}
err1, D1 = missing(df, strategy_s1)

# Step 2 — inject outliers on c2, mode="composed"
# Acts exclusively on the rows already modified by Step 1
strategy_s2 = {
    "affected_features": ["c2"],
    "selection_criteria": "c2 == c2",
    "percentage": 1.0,
    "mode": "composed",
    "perturbate_data": {"sampling": "random"}
}
err2, D2 = outlier(D1, strategy_s2, original_df=df)
# D2: rows with NaN in c1 coincide exactly with rows with outliers in c2

Modules & Specific Configurations

Error Injection Modules

1. Missing (missing.py)

Replaces values with NaN. Specifics: No special parameters required in perturbate_data.

2. Outliers (outliers.py)

Injects outliers using a 3-sigma rule for continuous numeric data, domain expansion for categorical integers, or specific string tokens for text. Specifics: No special parameters required in perturbate_data.

3. Duplicated (duplicated.py)

Duplicates existing rows and optionally applies text transformations. Specifics: Set "function" in the main strategy to apply text transformations like "shuffle_words", "abbreviate_text", "replace_punctuation", "remove_replace", or "upper_lower".

4. Noisy (noisy.py)

Adds random noise or a systematic shift to data (numeric, string, or datetime). Specifics: In perturbate_data, set "distribution": "shift" to apply systematic shifting. You must provide a "param" dictionary:

  • "shift_value": Numeric value to add (or days for dates).
  • "shift_unit": "absolute" or "std" (standard deviations).
  • "shift_sign": "positive", "negative", or "random".

(Use "distribution": "random" for standard uniform noise).

5. Labels (labels.py)

Flips labels for binary or multi-class classification. Specifics: For multi-class labels in perturbate_data, set "noise_model" to:

  • "NCAR" (Noise Completely At Random): Uniform random flip.
  • "NAR" (Noise At Random): Class-dependent flip. Provide "flip_distribution" in param.
  • "NNAR" (Nearest Neighbor At Random): Flips labels of instances close to decision boundaries. Provide "features_for_similarity" in param.

Drift Simulation

Pucktrick supports the simulation of dataset drift modelled as temporal corruption policies applied to dataset segments. Drift is exposed through a single unified function drift() that follows the same calling convention as all other modules:

from pucktrick.drift import drift

strategy = {
    "affected_features": ["f1", "f2"],
    "selection_criteria": "all",
    "percentage": 0.35,          # fraction of rows affected per chunk
    "mode": "new",
    "perturbate_data": {
        "sampling": "random",
        "target_col": "target",  # omit for auto-detection
        "chunks": {
            "0": None,           # baseline segment, no drift
            "1": None,
            "2": { ... },        # drift configuration for segment 2
            "3": { ... },
        }
    }
}

error, df_modified = drift(df, strategy)

Unlike other modules, the error return value is a dictionary:

{
    "errore": "yes",              # "yes" if any modification occurred, "no" otherwise
    "change_points": [100, 200, 300],   # row indices delimiting chunk boundaries
    "chunks": {
        "0": {"start": 0,   "end": 100, "drift_applied": False},
        "1": {"start": 100, "end": 200, "drift_applied": False},
        "2": {"start": 200, "end": 300, "drift_applied": True},
        "3": {"start": 300, "end": 400, "drift_applied": True},
    }
}

Note: sampling and distribution parameters inside perturbate_data are not used by the drift module — row selection within each chunk is always random and controlled exclusively by percentage.

Drift Types Summary

Drift type Distribution affected Module drift_type value
Data drift (covariate noise) $P(X)$ changes covariate_noise_drift "covariate_noise"
Data drift (offset) $P(X)$ changes covariate_offset_drift "covariate_offset"
Concept drift (target offset) $P(Y|X)$ changes offset_drift "concept"
Concept drift (feature rotation) $P(Y|X)$ changes concept_drift "concept_rotation"
Label drift (prior shift) $P(Y)$ changes prior_multinomial_drift "prior_multinomial"
Target scaling $P(Y)$ changes target_scaling_drift "target_scaling"
Generic (all types) configurable drift_generic any

6. Covariate Noise Drift (covariate_noise_drift.py)

Adds progressive Gaussian noise to selected features, simulating data drift where $P(X)$ shifts over time.

"2": {
    "drift_type": "covariate_noise",
    "features": ["temp", "humidity"],
    "noise_mode": "relative",
    "noise_std": 0.08,
    "shape": "segment"
}
  • "noise_mode": "relative" (noise proportional to feature std) or "absolute"
  • "noise_std": magnitude of Gaussian noise
  • "shape": "segment" (this chunk only) or "step" (persists in subsequent chunks)

7. Covariate Offset Drift (covariate_offset_drift.py)

Applies a systematic directional offset to selected features, simulating sensor calibration drift.

"2": {
    "drift_type": "covariate_offset",
    "features": ["temp", "humidity"],
    "offset_mode": "relative",
    "offset_scale": 0.20,
    "direction": "up",
    "shape": "step"
}
  • "offset_mode": "relative" or "absolute"
  • "offset_scale": magnitude of the offset
  • "direction": "up", "down", or "random"
  • "shape": "segment" or "step"

8. Concept Drift — Target Offset (offset_drift.py)

Shifts the target variable using a percentage offset, simulating concept drift where $P(Y \mid X)$ changes.

"2": {
    "drift_type": "concept",
    "features": ["<TARGET>"],
    "offset_perc": 0.50,
    "offset_mode": "add",
    "base": "mean",
    "shape": "step",
    "direction": "up"
}
  • "offset_perc": fractional offset applied to the base value
  • "offset_mode": "add" or "mul" (multiplicative)
  • "base": "mean", "median", "std", or "quantile"
  • "shape": "step", "ramp", "spike", or "sin"

9. Concept Drift — Feature Rotation (concept_drift.py)

Permutes or cycles feature values across instances, breaking the feature-label relationship without altering marginal distributions.

"2": {
    "drift_type": "concept_rotation",
    "severity": 0.65,
    "rotation_mode": "cycle",
    "shape": "step"
}
  • "severity": fraction of features involved (0.0–1.0)
  • "rotation_mode": "cycle" or "permute"

10. Label Drift — Prior Multinomial (prior_multinomial_drift.py)

Resamples the class distribution according to a user-specified probability vector, simulating prior probability shift $P(Y)$.

"2": {
    "drift_type": "prior_multinomial",
    "features": ["<TARGET>"],
    "bins": 3,
    "class_probs_list": [0.05, 0.15, 0.80],
    "temperature": 0.6
}
  • "bins": number of bins for numeric columns
  • "class_probs_list": probability vector for each bin/class
  • "temperature": sharpens (< 1.0) or flattens (> 1.0) the distribution

11. Target Scaling (target_scaling_drift.py)

Applies a multiplicative scaling factor to the numeric target variable.

"2": {
    "drift_type": "target_scaling",
    "scale_perc": 0.10,
    "shape": "segment"
}
  • "scale_perc": fractional increase (e.g., 0.10 multiplies target by 1.10)
  • "scale_factor": direct multiplicative factor (alternative to scale_perc)

12. Generic Drift (drift_generic.py)

A unified module supporting all drift types above plus additional specialized types (conditional, offset_time, seasonal_shift, prior_bool, concept_ord_shift, and others).


Version

version 1.0.1

  • Unified drift interface: drift() now follows the same calling convention as all other error modules (error, df_modified = drift(df, strategy)).
  • The error return value for drift is now a dictionary containing errore (yes/no), change_points, and per-chunk metadata (start, end, drift_applied).
  • Removed debug print statements from drift.py and drift_generic.py.

version 1.0.0

  • Added drift simulation modules: covariate_noise_drift, covariate_offset_drift, offset_drift, concept_drift, prior_multinomial_drift, target_scaling_drift, drift_generic. All modules support both strategy_path (JSON file) and strategy (Python dict) as input.
  • Added unified drift.py wrapper exposing a drift() function compatible with the PuckTrick strategy interface.
  • All drift modules integrated and tested with synthetic datasets.

version 0.6.1.1

  • Added composed mode to all modules.
  • Added _is_row_modified method to BaseErrorInjector for row-level modification tracking.
  • Fixed _get_modifiable_mask in MissingErrorInjector and LabelErrorInjector.
  • Fixed type normalization in NAR label flip for integer target columns.

version 0.6.0.1

  • Codebase fully refactored using Object-Oriented Programming with the Template Method Pattern.
  • Added systematic shift ("distribution": "shift") to the noisy module.
  • Standardized the strategy interface and improved extended mode logic across all modules.

version 0.5.1

  • add multiclass definition

version 0.5

  • add strategy JSON configuration.

version 0.4

  • errortype added: missing values

version 0.3

  • error type added: duplicated

version 0.2

  • error type inserted: outliers

version 0.1

  • error type inserted: noisy error and inconsistency labels

Installation

pip install pucktrick

Contributing

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/your-feature)
  3. Commit your changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/your-feature)
  5. Create a new Pull Request

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) — see the LICENSE file for details.

Acknowledgements

Thanks to the contributors and open-source community for their support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pucktrick-1.0.1.tar.gz (61.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pucktrick-1.0.1-py3-none-any.whl (60.7 kB view details)

Uploaded Python 3

File details

Details for the file pucktrick-1.0.1.tar.gz.

File metadata

  • Download URL: pucktrick-1.0.1.tar.gz
  • Upload date:
  • Size: 61.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pucktrick-1.0.1.tar.gz
Algorithm Hash digest
SHA256 55fff4b4b949c73568602be47b190eebc5a7184a7b07563224ee1a5767eb87dd
MD5 9acf83583f82e1103eebc540806f808d
BLAKE2b-256 9be28d075d6da7a7463edfc6d14d41577cf21c23625c2736627b3082f99f400e

See more details on using hashes here.

File details

Details for the file pucktrick-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pucktrick-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 60.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pucktrick-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b4f76e5a728dcd616967d22217c6390ebfef9ce24a6a2e91f1cd351abb7bf962
MD5 52d8811200b64e5073f12309c2af5171
BLAKE2b-256 b9bb3bb64fd9821ab7beb9605cb5498be37b9136402dfae25d230398724f0bd0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page