Skip to main content

A data quality auditing library for time-series tabular data in financial and sensor domains.

Project description

tsauditor

CI codecov License: MIT

A data-quality auditing library for time-series tabular data, with a focus on financial and sensor domains. tsauditor scans a DataFrame and returns a structured report of structural problems, anomalies, and — its core contribution — data-leakage between features and the prediction target.

The project grew out of a real bug in a Pakistani equity (OGDC) direction-prediction model: a same-day percentage-change feature (ChangeP) was mathematically near-identical to the target it was meant to predict. With ChangeP included, a Random Forest classifier reached 99.68% accuracy (AUC 0.9987); a Gradient Boosting classifier reached the same 99.68% accuracy (AUC 0.9967). Removing it — along with same-day Open, High, and Low, which are equally unavailable at prediction time — dropped accuracy to 69.81% (RF, AUC 0.7795) and 73.70% (GBM, AUC 0.8072) on a held-out test period (2025-01-09 to 2026-04-03). Both models still beat a 50% baseline, but the headline accuracy had been almost entirely an artifact of the leak. tsauditor exists to catch this class of mistake automatically before it reaches a model. See examples/ogdc_leakage_case for the full experiment, script, and measured results.

Installation

pip install tsauditor

Requires Python ≥ 3.9. Core dependencies: pandas, numpy, scipy, statsmodels, rich.

Development setup

git clone https://github.com/imann128/tsauditor.git
cd tsauditor
pip install -e ".[dev]"

Quickstart

import tsauditor as tsa

report = tsa.scan(df, target="Direction", domain="finance")

report.summary()                 # rich-formatted CLI table
report.critical                  # list[Issue] that block modeling
report.filter(module="leakage")  # programmatic filtering
report.to_json("report.json")    # structured export

scan() returns a GuardReport holding Issue dataclasses bucketed by severity (critical, warnings, info) plus dataset metadata.

What it checks

Module Code Severity Detects
profiler PRF001 warning Irregular timestamp frequency
profiler PRF002 warning Clustered missing values
profiler PRF003 info Non-stationarity (Augmented Dickey-Fuller)
profiler PRF004 warning Duplicate timestamps
profiler PRF005 warning Clustered gaps
profiler PRF006 warning High overall missing rate
anomaly ANO001 warning Stuck / repeated constant values
anomaly ANO002 warning Point outliers (z-score + IQR)
anomaly ANO003 warning Contextual spikes (local rolling z-score)
leakage LEK001 critical Target equivalence (feature reproduces the target)
leakage LEK002 warning Positive-lag cross-correlation peak (future info)
leakage LEK003 warning Rolling-window lookahead (excess over persistence)

Leakage detection (the research core)

Leakage checks are rank-based, chosen by target type:

  • LEK001 — equivalence. Continuous targets use |Spearman ρ|; binary targets use AUC separation (max(AUC, 1−AUC)). This is deliberate: Pearson against a binary 0/1 target is point-biserial correlation, which is capped near √(2/π) ≈ 0.798, so a feature whose sign defines the target scores only ~0.80 and slips under a naive threshold. AUC scores it 1.0.
  • LEK002 — cross-correlation. Flags features whose peak association with the target falls at a positive lag (the feature aligns with the target's future).
  • LEK003 — temporal lookahead. Flags features that correlate with the future target beyond what the target's own autocorrelation can explain — the signature of a forward-looking or centered window. The persistence baseline is what keeps a legitimate trailing feature from being false-flagged.

LEK002/LEK003 are WARNING-level suspicions: in pure cross-correlation a genuine strong predictor and a leak are distinguishable only by magnitude. LEK001 is CRITICAL because equivalence is near-deterministic.

Architecture

tsauditor/
├── scanner.py          # scan() — orchestrates all modules into a GuardReport
├── profiler/           # structural checks: frequency, missing, stationarity
├── anomaly/            # point.py, contextual.py
├── leakage/            # equivalence.py, correlation.py, temporal.py
├── report/summary.py   # GuardReport + Issue dataclasses, rich/JSON output
└── utils/validation.py # input validation & DataFrame normalization

Testing

pytest -q

Contributing

Contributions are welcome. Check open issues for ideas, or look for the good first issue label. Run pytest -q before opening a PR — all 93 tests must pass, and CI will verify this across Python 3.9–3.14 on Linux, Windows, and macOS.

Status

Beta (0.1.2). Profiler, anomaly, and leakage modules are implemented and tested (93 tests passing, CI across Python 3.9–3.14 on Linux, Windows, macOS).

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsauditor-0.1.2.tar.gz (35.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tsauditor-0.1.2-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file tsauditor-0.1.2.tar.gz.

File metadata

  • Download URL: tsauditor-0.1.2.tar.gz
  • Upload date:
  • Size: 35.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for tsauditor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 502e7a943a7578b8ed4d77f2e7a376da47fc9e6e9d4a4b19dd4327586d02f977
MD5 5131abed4aee814e05fb3e06acda63bd
BLAKE2b-256 10c6f63586d42c3e24a84b34780b43815bac3725fc6d7cc53ad626b8a6a2164b

See more details on using hashes here.

File details

Details for the file tsauditor-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: tsauditor-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for tsauditor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0116b4423fec2d6aa82490d667301a70f32dbb12e5bcf5cb10ffaacaaf511258
MD5 92a92f4d62b28fb801b36b2120c87ac8
BLAKE2b-256 a50216e4cc3361d77d620fb27d749d5bf3df986f4419327a8567434f17eb7cd1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page