AutoML anomaly detection and schema-driven data quality for ETL pipelines
Project description
adaptive_profiler
AutoML anomaly detection and schema-driven data quality checks for ETL pipelines.
Detects semantic anomalies that rule-based checks miss — stuck sensors, silent feeds, values that are numerically valid but statistically unusual. Each column gets its own model, trained on recent data, configured through a single YAML file.
Features
- Two-layer quality checking — rule-based data contract checks + ML-based anomaly detection, both declared in one YAML file
- AutoML via Optuna — searches over IForest, LOF, HBOS, COPOD, and ECOD; selects the best model per column automatically
- Per-source isolation — each (partition × column) pair trains its own model;
amsterdamandlondonnever share a model - Configurable training window — train on the N most-recent rows rather than the full history to keep retraining fast
- Manual model override — pin a specific algorithm and hyperparameters per column to skip Optuna entirely
- Tunable flag threshold — override the binary flagging threshold per column on a validation split without retraining
- Built-in cost projection —
ScalingBenchmarkfits T(n,m,k) = α·n^β·m^δ·k^γ so you can predict overhead before committing to production - Pipeline-safe output — failures reported in the output DataFrame, never raised as exceptions
Install
pip install adaptive-profiler
pip install "adaptive-profiler[s3]" # include boto3 for S3 storage
For development (from source):
git clone https://github.com/kooroshkz/adaptive-profiler
cd adaptive-profiler
pip install -e ".[dev]"
Quick start
from adaptive_profiler import Profiler
profiler = Profiler.from_yaml("profiling_schema.yml")
# Train models — one per (city, column) pair
results = profiler.train(partition_key="amsterdam", df=historical_df)
for r in results:
print(r)
# Score incoming data — returns long-format DataFrame
predictions = profiler.score(partition_key="amsterdam", df=new_df)
print(predictions[predictions["automl_flag"] == 1])
# Rule-based checks only
violations = profiler.check_quality(df=new_df)
Cost projection
from adaptive_profiler import ScalingBenchmark
bench = ScalingBenchmark(df, columns=["temperature_2m", "pressure"])
bench.run(quick=True) # ~1–2 min
bench.fit()
print(bench.report(target_n=100_000, m=6, k=25))
t = bench.predict(n=100_000, m=6, k=25)
Running tests
pytest
Module layout
| File | Purpose |
|---|---|
profiler.py |
Main Profiler class — train(), score(), check_quality() |
schema.py |
YAML config dataclasses — ProfilerConfig, ColumnConfig, TrainingConfig |
trainer.py |
Optuna HPO loop, manual override path, TrainingResult |
models.py |
PyOD model registry and Optuna search-space definitions |
quality.py |
Rule-based quality checks, check_dataframe(), quality_summary() |
store.py |
S3Store, LocalStore, ArtifactStore protocol |
projection.py |
ScalingBenchmark — benchmark, fit power-law, predict and report |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file adaptive_profiler-0.2.0.tar.gz.
File metadata
- Download URL: adaptive_profiler-0.2.0.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
036b925dcc9a56f896bd7608313d85f6933eca23263bb62535824ae2a6f10d1a
|
|
| MD5 |
a84985305aead8796c58b9d9ac75fc6e
|
|
| BLAKE2b-256 |
eccb3fd1af3bbb6039859c237211ea821ac8ddb06c221df9aa05166a246029a7
|
Provenance
The following attestation bundles were made for adaptive_profiler-0.2.0.tar.gz:
Publisher:
publish.yml on kooroshkz/adaptive-profiler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
adaptive_profiler-0.2.0.tar.gz -
Subject digest:
036b925dcc9a56f896bd7608313d85f6933eca23263bb62535824ae2a6f10d1a - Sigstore transparency entry: 1653112314
- Sigstore integration time:
-
Permalink:
kooroshkz/adaptive-profiler@97eef51b900fab89ec8365cf87c280d601e1b712 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/kooroshkz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@97eef51b900fab89ec8365cf87c280d601e1b712 -
Trigger Event:
push
-
Statement type:
File details
Details for the file adaptive_profiler-0.2.0-py3-none-any.whl.
File metadata
- Download URL: adaptive_profiler-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea0e9b6eff68c2f877374b8703067a5209a125c58650dab0d44b9cffee5c1998
|
|
| MD5 |
79ccda1ee9632f7da4477d1bb74ee838
|
|
| BLAKE2b-256 |
e19e88f53dbf3eda17580304f015016a19ba64a7a0a9abbc8c22b26d585ddab7
|
Provenance
The following attestation bundles were made for adaptive_profiler-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on kooroshkz/adaptive-profiler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
adaptive_profiler-0.2.0-py3-none-any.whl -
Subject digest:
ea0e9b6eff68c2f877374b8703067a5209a125c58650dab0d44b9cffee5c1998 - Sigstore transparency entry: 1653112443
- Sigstore integration time:
-
Permalink:
kooroshkz/adaptive-profiler@97eef51b900fab89ec8365cf87c280d601e1b712 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/kooroshkz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@97eef51b900fab89ec8365cf87c280d601e1b712 -
Trigger Event:
push
-
Statement type: