A data quality auditing library for time-series tabular data in financial and sensor domains.
Project description
tsauditor
A data-quality auditing library for time-series tabular data, with a focus on
financial and sensor domains. tsauditor scans a DataFrame and returns a
structured report of structural problems, anomalies, and — its core contribution —
data-leakage between features and the prediction target.
The project grew out of a real bug in a Pakistani equity (OGDC) direction-prediction
model: a same-day percentage-change feature (ChangeP) was mathematically near-identical
to the target it was meant to predict. With ChangeP included, a Random Forest
classifier reached 99.68% accuracy (AUC 0.9987); a Gradient Boosting classifier reached
the same 99.68% accuracy (AUC 0.9967). Removing it — along with same-day Open, High,
and Low, which are equally unavailable at prediction time — dropped accuracy to 69.81%
(RF, AUC 0.7795) and 73.70% (GBM, AUC 0.8072) on a held-out test period
(2025-01-09 to 2026-04-03). Both models still beat a 50% baseline, but the headline
accuracy had been almost entirely an artifact of the leak. tsauditor exists to catch
this class of mistake automatically before it reaches a model.
See examples/ogdc_leakage_case for the full experiment,
script, and measured results.
Installation
pip install tsauditor
Requires Python ≥ 3.9. Core dependencies: pandas, numpy, scipy, statsmodels, rich.
Development setup
git clone https://github.com/imann128/tsauditor.git
cd tsauditor
pip install -e ".[dev]"
Quickstart
import tsauditor as tsa
report = tsa.scan(df, target="Direction", domain="finance")
report.summary() # rich-formatted CLI table
report.critical # list[Issue] that block modeling
report.filter(module="leakage") # programmatic filtering
report.to_json("report.json") # structured export
scan() returns a GuardReport holding Issue dataclasses bucketed by severity
(critical, warnings, info) plus dataset metadata.
What it checks
| Module | Code | Severity | Detects |
|---|---|---|---|
| profiler | PRF001 | warning | Irregular timestamp frequency |
| profiler | PRF002 | warning | Clustered missing values |
| profiler | PRF003 | info | Non-stationarity (Augmented Dickey-Fuller) |
| profiler | PRF004 | warning | Duplicate timestamps |
| profiler | PRF005 | warning | Clustered gaps |
| profiler | PRF006 | warning | High overall missing rate |
| anomaly | ANO001 | warning | Stuck / repeated constant values |
| anomaly | ANO002 | warning | Point outliers (z-score + IQR) |
| anomaly | ANO003 | warning | Contextual spikes (local rolling z-score) |
| leakage | LEK001 | critical | Target equivalence (feature reproduces the target) |
| leakage | LEK002 | warning | Positive-lag cross-correlation peak (future info) |
| leakage | LEK003 | warning | Rolling-window lookahead (excess over persistence) |
Leakage detection (the research core)
Leakage checks are rank-based, chosen by target type:
- LEK001 — equivalence. Continuous targets use
|Spearman ρ|; binary targets use AUC separation (max(AUC, 1−AUC)). This is deliberate: Pearson against a binary 0/1 target is point-biserial correlation, which is capped near√(2/π) ≈ 0.798, so a feature whose sign defines the target scores only ~0.80 and slips under a naive threshold. AUC scores it 1.0. - LEK002 — cross-correlation. Flags features whose peak association with the target falls at a positive lag (the feature aligns with the target's future).
- LEK003 — temporal lookahead. Flags features that correlate with the future target beyond what the target's own autocorrelation can explain — the signature of a forward-looking or centered window. The persistence baseline is what keeps a legitimate trailing feature from being false-flagged.
LEK002/LEK003 are WARNING-level suspicions: in pure cross-correlation a genuine strong predictor and a leak are distinguishable only by magnitude. LEK001 is CRITICAL because equivalence is near-deterministic.
Architecture
tsauditor/
├── scanner.py # scan() — orchestrates all modules into a GuardReport
├── profiler/ # structural checks: frequency, missing, stationarity
├── anomaly/ # point.py, contextual.py
├── leakage/ # equivalence.py, correlation.py, temporal.py
├── report/summary.py # GuardReport + Issue dataclasses, rich/JSON output
└── utils/validation.py # input validation & DataFrame normalization
Testing
pytest -q
Contributing
Contributions are welcome. Check open issues
for ideas, or look for the good first issue label. Run pytest -q before opening a PR —
all 93 tests must pass, and CI will verify this across Python 3.9–3.14 on Linux, Windows, and macOS.
Status
Beta (0.1.2). Profiler, anomaly, and leakage modules are implemented and tested
(93 tests passing, CI across Python 3.9–3.14 on Linux, Windows, macOS).
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tsauditor-0.1.2.tar.gz.
File metadata
- Download URL: tsauditor-0.1.2.tar.gz
- Upload date:
- Size: 35.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
502e7a943a7578b8ed4d77f2e7a376da47fc9e6e9d4a4b19dd4327586d02f977
|
|
| MD5 |
5131abed4aee814e05fb3e06acda63bd
|
|
| BLAKE2b-256 |
10c6f63586d42c3e24a84b34780b43815bac3725fc6d7cc53ad626b8a6a2164b
|
File details
Details for the file tsauditor-0.1.2-py3-none-any.whl.
File metadata
- Download URL: tsauditor-0.1.2-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0116b4423fec2d6aa82490d667301a70f32dbb12e5bcf5cb10ffaacaaf511258
|
|
| MD5 |
92a92f4d62b28fb801b36b2120c87ac8
|
|
| BLAKE2b-256 |
a50216e4cc3361d77d620fb27d749d5bf3df986f4419327a8567434f17eb7cd1
|