Skip to main content

Inject configurable data quality chaos into clean datasets to stress-test DQ frameworks.

Project description

chaos-engine

Inject configurable, reproducible data quality chaos into clean DataFrames — then prove your DQ framework catches it.

Python Tests License


Why this exists

Every data team needs to stress-test their data quality framework (Great Expectations, Soda, dbt tests) — but you can't use production data in dev, and hand-crafting bad data is tedious and non-reproducible.

chaos-engine solves this: give it a clean DataFrame and a YAML config, and it injects precisely the anomalies you want — nulls, duplicates, type mismatches, schema drift, late-arriving rows, statistical outliers — with a deterministic seed so CI is always reproducible.

The ChaosReport tells you exactly what was injected, so you can assert your DQ suite caught it:

engine = ChaosEngine.from_yaml("chaos_config.yaml")
corrupted_df, report = engine.run(clean_df)

# Now prove Great Expectations detected what we injected
assert "email" in report.null_columns
assert ge_suite.validate(corrupted_df)["statistics"]["unsuccessful_expectations"] > 0

GE Detection Matrix

For each injector, Great Expectations detected the anomaly:

Injector Anomaly injected GE result
nulls 10% of email column nulled DETECTED
duplicates 5% duplicate customer IDs DETECTED
outliers revenue values at 10σ DETECTED
schema_drop revenue column removed DETECTED

Run pytest tests/test_ge_suite.py -v to reproduce this matrix.


Installation

# Core library
pip install chaos-engine

# With Great Expectations support
pip install "chaos-engine[ge]"

# With Soda support
pip install "chaos-engine[soda]"

For local development:

git clone https://github.com/yourhandle/chaos-engine
cd chaos-engine
pip install -e ".[dev,ge]"
pytest

Quick start

import pandas as pd
from chaos_engine import ChaosEngine

# 1. Load a clean dataset
df = pd.read_csv("clean_customers.csv")

# 2. Create engine from YAML config
engine = ChaosEngine.from_yaml("examples/chaos_config.yaml")

# 3. Run — input is never mutated
corrupted_df, report = engine.run(df)

# 4. Inspect what changed
print(report.summary())
# ChaosReport (seed=42)
#   Total mutations : 87
#   Injectors used  : duplicates, late_arriving, nulls, outliers, schema_drift, type_mismatch
#   [nulls] Injected 10 nulls into 'email' via random
#   [nulls] Injected 10 nulls into 'phone' via random
#   [duplicates] Injected 6 near duplicate rows
#   ...

# 5. Save output
corrupted_df.to_csv("corrupted_customers.csv", index=False)
report_json = report.to_json()

Or programmatically without YAML:

engine = ChaosEngine(seed=42, injectors={
    "nulls":      {"enabled": True, "rate": 0.05, "columns": ["email"]},
    "duplicates": {"enabled": True, "rate": 0.03, "mode": "near"},
    "outliers":   {"enabled": True, "columns": ["revenue"], "sigma": 6},
})
corrupted_df, report = engine.run(df)

CLI

# Run chaos injection from the command line
chaos-engine run examples/chaos_config.yaml clean_customers.csv \
    --output corrupted.parquet --format parquet \
    --report chaos_report.json

# Show which injectors are enabled in a config
chaos-engine inspect examples/chaos_config.yaml

Injectors

nulls — random or pattern-based nulls

nulls:
  enabled: true
  rate: 0.05          # fraction of rows to null per column
  columns: [email, phone]
  strategy: random    # or: pattern
  # pattern: "^test_"
  # pattern_column: email

duplicates — exact and near-duplicate rows

duplicates:
  enabled: true
  rate: 0.03
  mode: near          # exact | near (near fuzzes one field slightly)

type_mismatch — wrong types in typed columns

type_mismatch:
  enabled: true
  rate: 0.04
  columns:
    age: string        # inject words into int column
    revenue: negative  # inject negative numbers
    status: boolean    # inject "yes"/"no" into a categorical

Supported targets: string, boolean, negative, future_date, empty_string.

outliers — statistical anomalies

outliers:
  enabled: true
  rate: 0.02
  columns: [revenue, quantity]
  sigma: 6             # standard deviations beyond the mean
  mode: both           # high | low | both

late_arriving — shifted timestamps

late_arriving:
  enabled: true
  rate: 0.02
  columns: [created_at]
  max_delay_days: 14
  direction: past      # past | future

schema_drift — structural changes

schema_drift:
  enabled: true
  rename: {customer_id: cust_id}   # break joins
  drop: [internal_flag]             # remove expected columns
  add: {mystery_column: null}       # add unexpected columns
  reorder: false                    # shuffle column order

ChaosReport API

corrupted_df, report = engine.run(df)

report.total_mutations         # int — total cells/rows affected
report.injector_names          # ['duplicates', 'nulls', 'outliers', ...]
report.null_columns            # ['email', 'phone']
report.duplicate_row_indices   # [200, 201, 202, ...]
report.schema_changes          # [{'rename_map': {...}}, ...]

report.by_injector("nulls")    # list[InjectionRecord]
report.summary()               # human-readable string
report.to_json()               # JSON string
report.to_dataframe()          # tidy pandas DataFrame

Custom injectors

Register your own injector with a decorator:

from chaos_engine import ChaosEngine, BaseInjector, InjectionRecord
import pandas as pd
import numpy as np

@ChaosEngine.register("encoding_chaos")
class EncodingInjector(BaseInjector):
    name = "encoding_chaos"

    def inject(self, df, rng, config):
        col = config.get("column", "name")
        rows = self._sample_rows(df, config.get("rate", 0.02), rng)
        df[col] = df[col].astype(object)
        for i in rows:
            df.at[df.index[i], col] = "Ren\u00e9e M\u00fcller \u4e2d\u6587"  # unicode chaos
        record = InjectionRecord(
            injector=self.name,
            description=f"Injected encoding chaos into '{col}'",
            affected_rows=rows.tolist(),
            affected_columns=[col],
        )
        return df, [record]

# Now use it in a config
engine = ChaosEngine(seed=42, injectors={
    "encoding_chaos": {"enabled": True, "column": "name", "rate": 0.03},
})

Running tests

# All tests
pytest

# Unit tests only (no GE dependency)
pytest tests/test_injectors.py -v

# GE integration tests + detection matrix
pytest tests/test_ge_suite.py -v -s

Architecture

ChaosEngine.run(df)
    │
    ├── NullInjector       → random / pattern nulls
    ├── DupeInjector       → exact / near duplicates
    ├── TypeInjector       → type mismatches
    ├── OutlierInjector    → statistical anomalies
    ├── LateInjector       → timestamp shifts
    └── SchemaInjector     → rename / drop / add columns (always last)
              │
              ▼
    ChaosReport            → audit trail of every mutation

Injectors run in a canonical order (schema drift last, since it renames columns others reference). The seeded numpy.random.Generator is threaded through every injector so the full pipeline is deterministic.


Stack

  • Python 3.10+, pandas, numpy, PyArrow
  • Great Expectations v1.x (integration tests)
  • pydantic, PyYAML, click, rich
  • pytest + pytest-cov (CI)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chaos_engine-0.1.0.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chaos_engine-0.1.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file chaos_engine-0.1.0.tar.gz.

File metadata

  • Download URL: chaos_engine-0.1.0.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for chaos_engine-0.1.0.tar.gz
Algorithm Hash digest
SHA256 08b76ffd2a145eba0a2cf1bec327e6097d952577ad6524c6cd606d4310a8da94
MD5 d618984027c3ef82bb3826d68f387506
BLAKE2b-256 caa3ae532edee04f6d3a3709e6766f49f0d8d12e45c89bf097058c2b24bfb203

See more details on using hashes here.

File details

Details for the file chaos_engine-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chaos_engine-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for chaos_engine-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 621704c4f06f439435cfbe3e7d06bcbf41c3ccd3a9dfa7e9cb5db06ccfe2a66c
MD5 ea49443bdb4424a56adf77ef3e173d47
BLAKE2b-256 4f5b66543f07df58dc2d567c58bbec6e9241cbc3c24fa36858972e5f99831af9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page