A python library for error generation in dataset for machine learning

These details have not been verified by PyPI

Project links

Homepage

License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Pucktrick

Pucktrick is a Python library that provides utility functions for introducing errors in your dataframe. The library's name is based on Puck. Puck is the name of the elf in the "A Midsummer Night's Dream" of William Shakespeare, who is very famous for causing trouble and playing tricks on mortals and other fairies alike.

Features

Pucktrick is organized in modules, one for each error type. Each module includes a main function (or a class injector) that receives as parameters the dataset to modify, the strategy dictionary, and the original dataset if mode="extended" or mode="composed". Functions return two parameters: an error descriptor and the generated dataset.

For standard error modules, the error descriptor is an integer code (0 = success, 1 = no modifications applied). For the drift module, it is a dictionary with richer information (see Drift Simulation).

The Strategy Configuration

The core of Pucktrick is the strategy configuration, which is passed as a JSON object or a Python dictionary. It allows you to precisely define the error model.

Base Parameters

{
  "affected_features": ["column1", "column2"],
  "selection_criteria": "all",
  "percentage": 0.2,
  "mode": "new",
  "perturbate_data": {
    "sampling": "random"
  }
}

affected_features: A list of strings specifying the columns to be corrupted.
selection_criteria: A predicate (e.g., "age > 30") to target specific rows, or "all" to target the entire dataset.
percentage: A float (0.0 to 1.0) indicating the proportion of targeted rows to corrupt.
mode:
- "new": Applies errors independently to the clean baseline dataset D0. Each call is stateless.
- "extended": Incrementally adds errors to a previously corrupted dataset. Reads original_df (the clean D0) to identify rows already modified and adds corruption only to unmodified rows, up to the cumulative percentage target. No row is corrupted twice.
- "composed": Applies errors exclusively to rows that have already been modified by a previous operator, using original_df (the clean D0) to identify them via a row-level, NaN-aware comparison across all columns. The percentage parameter controls what fraction of the already-modified set to corrupt. This enables cross-type corruption pipelines where heterogeneous errors are layered on the same row subset.
perturbate_data: A dictionary containing the noise injection logic.
- sampling: How rows are chosen ("random", "uniform", "normal", "exponential").

Accumulation Modes: Summary

Mode	Eligible rows	`percentage` applies to	Requires `original_df`
`new`	All rows	Full eligible set	No
`extended`	Rows not yet modified	Full eligible set (cumulative)	Yes
`composed`	Rows already modified in any column	Already-modified set	Yes

Example: Composed Pipeline

from pucktrick.missing import missing
from pucktrick.outliers import outlier

# Step 1 — inject missing values on c1 (20% of rows), mode="new"
strategy_s1 = {
    "affected_features": ["c1"],
    "selection_criteria": "c1 == c1",
    "percentage": 0.20,
    "mode": "new",
    "perturbate_data": {"sampling": "random"}
}
err1, D1 = missing(df, strategy_s1)

# Step 2 — inject outliers on c2, mode="composed"
# Acts exclusively on the rows already modified by Step 1
strategy_s2 = {
    "affected_features": ["c2"],
    "selection_criteria": "c2 == c2",
    "percentage": 1.0,
    "mode": "composed",
    "perturbate_data": {"sampling": "random"}
}
err2, D2 = outlier(D1, strategy_s2, original_df=df)
# D2: rows with NaN in c1 coincide exactly with rows with outliers in c2

Modules & Specific Configurations

Error Injection Modules

1. Missing (`missing.py`)

Replaces values with NaN. Specifics: No special parameters required in perturbate_data.

2. Outliers (`outliers.py`)

Injects outliers using a 3-sigma rule for continuous numeric data, domain expansion for categorical integers, or specific string tokens for text. Specifics: No special parameters required in perturbate_data.

3. Duplicated (`duplicated.py`)

Duplicates existing rows and optionally applies text transformations. Specifics: Set "function" in the main strategy to apply text transformations like "shuffle_words", "abbreviate_text", "replace_punctuation", "remove_replace", or "upper_lower".

4. Noisy (`noisy.py`)

Adds random noise or a systematic shift to data (numeric, string, or datetime). Specifics: In perturbate_data, set "distribution": "shift" to apply systematic shifting. You must provide a "param" dictionary:

"shift_value": Numeric value to add (or days for dates).
"shift_unit": "absolute" or "std" (standard deviations).
"shift_sign": "positive", "negative", or "random".

(Use "distribution": "random" for standard uniform noise).

5. Labels (`labels.py`)

Flips labels for binary or multi-class classification. Specifics: For multi-class labels in perturbate_data, set "noise_model" to:

"NCAR" (Noise Completely At Random): Uniform random flip.
"NAR" (Noise At Random): Class-dependent flip. Provide "flip_distribution" in param.
"NNAR" (Nearest Neighbor At Random): Flips labels of instances close to decision boundaries. Provide "features_for_similarity" in param.

Drift Simulation

Pucktrick supports the simulation of dataset drift modelled as temporal corruption policies applied to dataset segments. Drift is exposed through a single unified function drift() that follows the same calling convention as all other modules:

from pucktrick.drift import drift

strategy = {
    "affected_features": ["f1", "f2"],
    "selection_criteria": "all",
    "percentage": 0.35,          # fraction of rows affected per chunk
    "mode": "new",
    "perturbate_data": {
        "sampling": "random",
        "target_col": "target",  # omit for auto-detection
        "chunks": {
            "0": None,           # baseline segment, no drift
            "1": None,
            "2": { ... },        # drift configuration for segment 2
            "3": { ... },
        }
    }
}

error, df_modified = drift(df, strategy)

Unlike other modules, the error return value is a dictionary:

{
    "errore": "yes",              # "yes" if any modification occurred, "no" otherwise
    "change_points": [100, 200, 300],   # row indices delimiting chunk boundaries
    "chunks": {
        "0": {"start": 0,   "end": 100, "drift_applied": False},
        "1": {"start": 100, "end": 200, "drift_applied": False},
        "2": {"start": 200, "end": 300, "drift_applied": True},
        "3": {"start": 300, "end": 400, "drift_applied": True},
    }
}

Note: sampling and distribution parameters inside perturbate_data are not used by the drift module — row selection within each chunk is always random and controlled exclusively by percentage.

Drift Types Summary

Drift type	Distribution affected	Module	`drift_type` value
Data drift (covariate noise)	$P(X)$ changes	`covariate_noise_drift`	`"covariate_noise"`
Data drift (offset)	$P(X)$ changes	`covariate_offset_drift`	`"covariate_offset"`
Concept drift (target offset)	$P(Y\|X)$ changes	`offset_drift`	`"concept"`
Concept drift (feature rotation)	$P(Y\|X)$ changes	`concept_drift`	`"concept_rotation"`
Label drift (prior shift)	$P(Y)$ changes	`prior_multinomial_drift`	`"prior_multinomial"`
Target scaling	$P(Y)$ changes	`target_scaling_drift`	`"target_scaling"`
Generic (all types)	configurable	`drift_generic`	any

6. Covariate Noise Drift (`covariate_noise_drift.py`)

Adds progressive Gaussian noise to selected features, simulating data drift where $P(X)$ shifts over time.

"2": {
    "drift_type": "covariate_noise",
    "features": ["temp", "humidity"],
    "noise_mode": "relative",
    "noise_std": 0.08,
    "shape": "segment"
}

"noise_mode": "relative" (noise proportional to feature std) or "absolute"
"noise_std": magnitude of Gaussian noise
"shape": "segment" (this chunk only) or "step" (persists in subsequent chunks)

7. Covariate Offset Drift (`covariate_offset_drift.py`)

Applies a systematic directional offset to selected features, simulating sensor calibration drift.

"2": {
    "drift_type": "covariate_offset",
    "features": ["temp", "humidity"],
    "offset_mode": "relative",
    "offset_scale": 0.20,
    "direction": "up",
    "shape": "step"
}

"offset_mode": "relative" or "absolute"
"offset_scale": magnitude of the offset
"direction": "up", "down", or "random"
"shape": "segment" or "step"

8. Concept Drift — Target Offset (`offset_drift.py`)

Shifts the target variable using a percentage offset, simulating concept drift where $P(Y \mid X)$ changes.

"2": {
    "drift_type": "concept",
    "features": ["<TARGET>"],
    "offset_perc": 0.50,
    "offset_mode": "add",
    "base": "mean",
    "shape": "step",
    "direction": "up"
}

"offset_perc": fractional offset applied to the base value
"offset_mode": "add" or "mul" (multiplicative)
"base": "mean", "median", "std", or "quantile"
"shape": "step", "ramp", "spike", or "sin"

9. Concept Drift — Feature Rotation (`concept_drift.py`)

Permutes or cycles feature values across instances, breaking the feature-label relationship without altering marginal distributions.

"2": {
    "drift_type": "concept_rotation",
    "severity": 0.65,
    "rotation_mode": "cycle",
    "shape": "step"
}

"severity": fraction of features involved (0.0–1.0)
"rotation_mode": "cycle" or "permute"

10. Label Drift — Prior Multinomial (`prior_multinomial_drift.py`)

Resamples the class distribution according to a user-specified probability vector, simulating prior probability shift $P(Y)$.

"2": {
    "drift_type": "prior_multinomial",
    "features": ["<TARGET>"],
    "bins": 3,
    "class_probs_list": [0.05, 0.15, 0.80],
    "temperature": 0.6
}

"bins": number of bins for numeric columns
"class_probs_list": probability vector for each bin/class
"temperature": sharpens (< 1.0) or flattens (> 1.0) the distribution

11. Target Scaling (`target_scaling_drift.py`)

Applies a multiplicative scaling factor to the numeric target variable.

"2": {
    "drift_type": "target_scaling",
    "scale_perc": 0.10,
    "shape": "segment"
}

"scale_perc": fractional increase (e.g., 0.10 multiplies target by 1.10)
"scale_factor": direct multiplicative factor (alternative to scale_perc)

12. Generic Drift (`drift_generic.py`)

A unified module supporting all drift types above plus additional specialized types (conditional, offset_time, seasonal_shift, prior_bool, concept_ord_shift, and others).

Version

version 1.0.1

Unified drift interface: drift() now follows the same calling convention as all other error modules (error, df_modified = drift(df, strategy)).
The error return value for drift is now a dictionary containing errore (yes/no), change_points, and per-chunk metadata (start, end, drift_applied).
Removed debug print statements from drift.py and drift_generic.py.

version 1.0.0

Added drift simulation modules: covariate_noise_drift, covariate_offset_drift, offset_drift, concept_drift, prior_multinomial_drift, target_scaling_drift, drift_generic. All modules support both strategy_path (JSON file) and strategy (Python dict) as input.
Added unified drift.py wrapper exposing a drift() function compatible with the PuckTrick strategy interface.
All drift modules integrated and tested with synthetic datasets.

version 0.6.1.1

Added composed mode to all modules.
Added _is_row_modified method to BaseErrorInjector for row-level modification tracking.
Fixed _get_modifiable_mask in MissingErrorInjector and LabelErrorInjector.
Fixed type normalization in NAR label flip for integer target columns.

version 0.6.0.1

Codebase fully refactored using Object-Oriented Programming with the Template Method Pattern.
Added systematic shift ("distribution": "shift") to the noisy module.
Standardized the strategy interface and improved extended mode logic across all modules.

version 0.5.1

add multiclass definition

version 0.5

add strategy JSON configuration.

version 0.4

errortype added: missing values

version 0.3

error type added: duplicated

version 0.2

error type inserted: outliers

version 0.1

error type inserted: noisy error and inconsistency labels

Installation

pip install pucktrick

Contributing

Fork the repository
Create a new branch (git checkout -b feature/your-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/your-feature)
Create a new Pull Request

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) — see the LICENSE file for details.

Acknowledgements

Thanks to the contributors and open-source community for their support.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.1

May 27, 2026

1.0.0

May 27, 2026

0.6.0.2

May 3, 2026

0.6.0.1

Apr 16, 2026

0.5.1.2

Aug 25, 2025

0.5.1.1

Aug 25, 2025

0.5.1

Aug 22, 2025

0.5.0

Aug 21, 2025

0.4.2.1

Aug 24, 2024

0.4.2

Aug 24, 2024

0.4.1

Aug 23, 2024

0.4

Aug 14, 2024

0.3

Aug 12, 2024

0.2

Aug 6, 2024

0.1.0.1

Aug 2, 2024

0.1

Aug 2, 2024

0.0.0

Aug 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pucktrick-1.0.1.tar.gz (61.7 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pucktrick-1.0.1-py3-none-any.whl (60.7 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file pucktrick-1.0.1.tar.gz.

File metadata

Download URL: pucktrick-1.0.1.tar.gz
Upload date: May 27, 2026
Size: 61.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pucktrick-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`55fff4b4b949c73568602be47b190eebc5a7184a7b07563224ee1a5767eb87dd`
MD5	`9acf83583f82e1103eebc540806f808d`
BLAKE2b-256	`9be28d075d6da7a7463edfc6d14d41577cf21c23625c2736627b3082f99f400e`

See more details on using hashes here.

File details

Details for the file pucktrick-1.0.1-py3-none-any.whl.

File metadata

Download URL: pucktrick-1.0.1-py3-none-any.whl
Upload date: May 27, 2026
Size: 60.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pucktrick-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4f76e5a728dcd616967d22217c6390ebfef9ce24a6a2e91f1cd351abb7bf962`
MD5	`52d8811200b64e5073f12309c2af5171`
BLAKE2b-256	`b9bb3bb64fd9821ab7beb9605cb5498be37b9136402dfae25d230398724f0bd0`

See more details on using hashes here.

pucktrick 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pucktrick

Features

The Strategy Configuration

Base Parameters

Accumulation Modes: Summary

Example: Composed Pipeline

Modules & Specific Configurations

Error Injection Modules

1. Missing (missing.py)

2. Outliers (outliers.py)

3. Duplicated (duplicated.py)

4. Noisy (noisy.py)

5. Labels (labels.py)

Drift Simulation

Drift Types Summary

6. Covariate Noise Drift (covariate_noise_drift.py)

7. Covariate Offset Drift (covariate_offset_drift.py)

8. Concept Drift — Target Offset (offset_drift.py)

9. Concept Drift — Feature Rotation (concept_drift.py)

10. Label Drift — Prior Multinomial (prior_multinomial_drift.py)

11. Target Scaling (target_scaling_drift.py)

12. Generic Drift (drift_generic.py)

Version

Installation

Contributing

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Missing (`missing.py`)

2. Outliers (`outliers.py`)

3. Duplicated (`duplicated.py`)

4. Noisy (`noisy.py`)

5. Labels (`labels.py`)

6. Covariate Noise Drift (`covariate_noise_drift.py`)

7. Covariate Offset Drift (`covariate_offset_drift.py`)

8. Concept Drift — Target Offset (`offset_drift.py`)

9. Concept Drift — Feature Rotation (`concept_drift.py`)

10. Label Drift — Prior Multinomial (`prior_multinomial_drift.py`)

11. Target Scaling (`target_scaling_drift.py`)

12. Generic Drift (`drift_generic.py`)