Skip to main content

AutoML anomaly detection and schema-driven data quality for ETL pipelines

Project description

adaptive_profiler

AutoML anomaly detection and schema-driven data quality checks for ETL pipelines.

Detects semantic anomalies that rule-based checks miss — stuck sensors, silent feeds, values that are numerically valid but statistically unusual. Each column gets its own model, trained on recent data, configured through a single YAML file.

Features

  • Two-layer quality checking — rule-based data contract checks + ML-based anomaly detection, both declared in one YAML file
  • AutoML via Optuna — searches over IForest, LOF, HBOS, COPOD, and ECOD; selects the best model per column automatically
  • Per-source isolation — each (partition × column) pair trains its own model; amsterdam and london never share a model
  • Configurable training window — train on the N most-recent rows rather than the full history to keep retraining fast
  • Manual model override — pin a specific algorithm and hyperparameters per column to skip Optuna entirely
  • Tunable flag threshold — override the binary flagging threshold per column on a validation split without retraining
  • Built-in cost projectionScalingBenchmark fits T(n,m,k) = α·n^β·m^δ·k^γ so you can predict overhead before committing to production
  • Pipeline-safe output — failures reported in the output DataFrame, never raised as exceptions

Install

pip install adaptive-profiler
pip install "adaptive-profiler[s3]"   # include boto3 for S3 storage

For development (from source):

git clone https://github.com/kooroshkz/adaptive-profiler
cd adaptive-profiler
pip install -e ".[dev]"

Quick start

from adaptive_profiler import Profiler

profiler = Profiler.from_yaml("profiling_schema.yml")

# Train models — one per (city, column) pair
results = profiler.train(partition_key="amsterdam", df=historical_df)
for r in results:
    print(r)

# Score incoming data — returns long-format DataFrame
predictions = profiler.score(partition_key="amsterdam", df=new_df)
print(predictions[predictions["automl_flag"] == 1])

# Rule-based checks only
violations = profiler.check_quality(df=new_df)

Cost projection

from adaptive_profiler import ScalingBenchmark

bench = ScalingBenchmark(df, columns=["temperature_2m", "pressure"])
bench.run(quick=True)   # ~1–2 min
bench.fit()
print(bench.report(target_n=100_000, m=6, k=25))
t = bench.predict(n=100_000, m=6, k=25)

Running tests

pytest

Module layout

File Purpose
profiler.py Main Profiler class — train(), score(), check_quality()
schema.py YAML config dataclasses — ProfilerConfig, ColumnConfig, TrainingConfig
trainer.py Optuna HPO loop, manual override path, TrainingResult
models.py PyOD model registry and Optuna search-space definitions
quality.py Rule-based quality checks, check_dataframe(), quality_summary()
store.py S3Store, LocalStore, ArtifactStore protocol
projection.py ScalingBenchmark — benchmark, fit power-law, predict and report

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptive_profiler-0.2.0.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adaptive_profiler-0.2.0-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file adaptive_profiler-0.2.0.tar.gz.

File metadata

  • Download URL: adaptive_profiler-0.2.0.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for adaptive_profiler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 036b925dcc9a56f896bd7608313d85f6933eca23263bb62535824ae2a6f10d1a
MD5 a84985305aead8796c58b9d9ac75fc6e
BLAKE2b-256 eccb3fd1af3bbb6039859c237211ea821ac8ddb06c221df9aa05166a246029a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for adaptive_profiler-0.2.0.tar.gz:

Publisher: publish.yml on kooroshkz/adaptive-profiler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file adaptive_profiler-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for adaptive_profiler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea0e9b6eff68c2f877374b8703067a5209a125c58650dab0d44b9cffee5c1998
MD5 79ccda1ee9632f7da4477d1bb74ee838
BLAKE2b-256 e19e88f53dbf3eda17580304f015016a19ba64a7a0a9abbc8c22b26d585ddab7

See more details on using hashes here.

Provenance

The following attestation bundles were made for adaptive_profiler-0.2.0-py3-none-any.whl:

Publisher: publish.yml on kooroshkz/adaptive-profiler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page