Inject configurable data quality chaos into clean datasets to stress-test DQ frameworks.
Project description
chaos-engine
Inject configurable, reproducible data quality chaos into clean DataFrames — then prove your DQ framework catches it.
Why this exists
Every data team needs to stress-test their data quality framework (Great Expectations, Soda, dbt tests) — but you can't use production data in dev, and hand-crafting bad data is tedious and non-reproducible.
chaos-engine solves this: give it a clean DataFrame and a YAML config, and it injects precisely the anomalies you want — nulls, duplicates, type mismatches, schema drift, late-arriving rows, statistical outliers — with a deterministic seed so CI is always reproducible.
The ChaosReport tells you exactly what was injected, so you can assert your DQ suite caught it:
engine = ChaosEngine.from_yaml("chaos_config.yaml")
corrupted_df, report = engine.run(clean_df)
# Now prove Great Expectations detected what we injected
assert "email" in report.null_columns
assert ge_suite.validate(corrupted_df)["statistics"]["unsuccessful_expectations"] > 0
GE Detection Matrix
For each injector, Great Expectations detected the anomaly:
| Injector | Anomaly injected | GE result |
|---|---|---|
| nulls | 10% of email column nulled | DETECTED |
| duplicates | 5% duplicate customer IDs | DETECTED |
| outliers | revenue values at 10σ | DETECTED |
| schema_drop | revenue column removed |
DETECTED |
Run pytest tests/test_ge_suite.py -v to reproduce this matrix.
Installation
# Core library
pip install chaos-engine
# With Great Expectations support
pip install "chaos-engine[ge]"
# With Soda support
pip install "chaos-engine[soda]"
For local development:
git clone https://github.com/yourhandle/chaos-engine
cd chaos-engine
pip install -e ".[dev,ge]"
pytest
Quick start
import pandas as pd
from chaos_engine import ChaosEngine
# 1. Load a clean dataset
df = pd.read_csv("clean_customers.csv")
# 2. Create engine from YAML config
engine = ChaosEngine.from_yaml("examples/chaos_config.yaml")
# 3. Run — input is never mutated
corrupted_df, report = engine.run(df)
# 4. Inspect what changed
print(report.summary())
# ChaosReport (seed=42)
# Total mutations : 87
# Injectors used : duplicates, late_arriving, nulls, outliers, schema_drift, type_mismatch
# [nulls] Injected 10 nulls into 'email' via random
# [nulls] Injected 10 nulls into 'phone' via random
# [duplicates] Injected 6 near duplicate rows
# ...
# 5. Save output
corrupted_df.to_csv("corrupted_customers.csv", index=False)
report_json = report.to_json()
Or programmatically without YAML:
engine = ChaosEngine(seed=42, injectors={
"nulls": {"enabled": True, "rate": 0.05, "columns": ["email"]},
"duplicates": {"enabled": True, "rate": 0.03, "mode": "near"},
"outliers": {"enabled": True, "columns": ["revenue"], "sigma": 6},
})
corrupted_df, report = engine.run(df)
CLI
# Run chaos injection from the command line
chaos-engine run examples/chaos_config.yaml clean_customers.csv \
--output corrupted.parquet --format parquet \
--report chaos_report.json
# Show which injectors are enabled in a config
chaos-engine inspect examples/chaos_config.yaml
Injectors
nulls — random or pattern-based nulls
nulls:
enabled: true
rate: 0.05 # fraction of rows to null per column
columns: [email, phone]
strategy: random # or: pattern
# pattern: "^test_"
# pattern_column: email
duplicates — exact and near-duplicate rows
duplicates:
enabled: true
rate: 0.03
mode: near # exact | near (near fuzzes one field slightly)
type_mismatch — wrong types in typed columns
type_mismatch:
enabled: true
rate: 0.04
columns:
age: string # inject words into int column
revenue: negative # inject negative numbers
status: boolean # inject "yes"/"no" into a categorical
Supported targets: string, boolean, negative, future_date, empty_string.
outliers — statistical anomalies
outliers:
enabled: true
rate: 0.02
columns: [revenue, quantity]
sigma: 6 # standard deviations beyond the mean
mode: both # high | low | both
late_arriving — shifted timestamps
late_arriving:
enabled: true
rate: 0.02
columns: [created_at]
max_delay_days: 14
direction: past # past | future
schema_drift — structural changes
schema_drift:
enabled: true
rename: {customer_id: cust_id} # break joins
drop: [internal_flag] # remove expected columns
add: {mystery_column: null} # add unexpected columns
reorder: false # shuffle column order
ChaosReport API
corrupted_df, report = engine.run(df)
report.total_mutations # int — total cells/rows affected
report.injector_names # ['duplicates', 'nulls', 'outliers', ...]
report.null_columns # ['email', 'phone']
report.duplicate_row_indices # [200, 201, 202, ...]
report.schema_changes # [{'rename_map': {...}}, ...]
report.by_injector("nulls") # list[InjectionRecord]
report.summary() # human-readable string
report.to_json() # JSON string
report.to_dataframe() # tidy pandas DataFrame
Custom injectors
Register your own injector with a decorator:
from chaos_engine import ChaosEngine, BaseInjector, InjectionRecord
import pandas as pd
import numpy as np
@ChaosEngine.register("encoding_chaos")
class EncodingInjector(BaseInjector):
name = "encoding_chaos"
def inject(self, df, rng, config):
col = config.get("column", "name")
rows = self._sample_rows(df, config.get("rate", 0.02), rng)
df[col] = df[col].astype(object)
for i in rows:
df.at[df.index[i], col] = "Ren\u00e9e M\u00fcller \u4e2d\u6587" # unicode chaos
record = InjectionRecord(
injector=self.name,
description=f"Injected encoding chaos into '{col}'",
affected_rows=rows.tolist(),
affected_columns=[col],
)
return df, [record]
# Now use it in a config
engine = ChaosEngine(seed=42, injectors={
"encoding_chaos": {"enabled": True, "column": "name", "rate": 0.03},
})
Running tests
# All tests
pytest
# Unit tests only (no GE dependency)
pytest tests/test_injectors.py -v
# GE integration tests + detection matrix
pytest tests/test_ge_suite.py -v -s
Architecture
ChaosEngine.run(df)
│
├── NullInjector → random / pattern nulls
├── DupeInjector → exact / near duplicates
├── TypeInjector → type mismatches
├── OutlierInjector → statistical anomalies
├── LateInjector → timestamp shifts
└── SchemaInjector → rename / drop / add columns (always last)
│
▼
ChaosReport → audit trail of every mutation
Injectors run in a canonical order (schema drift last, since it renames columns others reference). The seeded numpy.random.Generator is threaded through every injector so the full pipeline is deterministic.
Stack
- Python 3.10+, pandas, numpy, PyArrow
- Great Expectations v1.x (integration tests)
- pydantic, PyYAML, click, rich
- pytest + pytest-cov (CI)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chaos_engine-0.1.0.tar.gz.
File metadata
- Download URL: chaos_engine-0.1.0.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08b76ffd2a145eba0a2cf1bec327e6097d952577ad6524c6cd606d4310a8da94
|
|
| MD5 |
d618984027c3ef82bb3826d68f387506
|
|
| BLAKE2b-256 |
caa3ae532edee04f6d3a3709e6766f49f0d8d12e45c89bf097058c2b24bfb203
|
File details
Details for the file chaos_engine-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chaos_engine-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
621704c4f06f439435cfbe3e7d06bcbf41c3ccd3a9dfa7e9cb5db06ccfe2a66c
|
|
| MD5 |
ea49443bdb4424a56adf77ef3e173d47
|
|
| BLAKE2b-256 |
4f5b66543f07df58dc2d567c58bbec6e9241cbc3c24fa36858972e5f99831af9
|