BenchAudit -- data hygiene and similarity audits for molecular and DTI benchmarks.
Project description
BenchAudit
BenchAudit is a lightweight pipeline for auditing molecular property and drug–target interaction benchmarks. It standardizes SMILES strings, checks split hygiene, surfaces label conflicts and activity cliffs, and can run simple baseline models. Outputs are machine‑readable summaries and drill‑down tables you can inspect or feed into other tools.
Features
- Config‑driven analysis of tabular, TDC, Polaris, and DTI datasets.
- SMILES standardization with optional REOS alerts and configurable fingerprint settings.
- Split hygiene reports: duplicates, cross‑split contamination, and nearest‑neighbor similarity.
- Conflict and activity‑cliff detection for classification and regression tasks.
- DTI extras: sequence normalization, cross‑split pair conflicts, and EMBOSS
stretcheralignment summaries. - Optional simple baselines for quick performance sanity checks.
Installation
From PyPI
Install the published package:
pip install benchaudit
or with uv:
uv pip install benchaudit
From source with uv
BenchAudit uses a standard pyproject.toml. The quickest source setup is with uv:
# 1) Create a virtual environment
uv venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# 2) Install dependencies declared in pyproject.toml
uv sync
If you need optional sequence alignment support, install EMBOSS so stretcher is available (e.g., sudo apt install emboss on Debian/Ubuntu).
Automated PyPI publishing
This repo includes .github/workflows/publish-pypi.yml for automated releases.
- In PyPI, configure a Trusted Publisher for this GitHub repository and workflow file (
.github/workflows/publish-pypi.yml), using environmentpypi. - Bump
project.versioninpyproject.toml. - Create and push a tag
vX.Y.Zmatching that version (for examplev0.1.1). - GitHub Actions builds with
uv buildand publishes to PyPI automatically when the repository visibility ispublic(publishing is skipped while private).
Detailed release and install documentation: docs/publishing_and_installation.md
References
- Package on PyPI: https://pypi.org/project/benchaudit/
- Publish workflow:
.github/workflows/publish-pypi.yml - CI workflow:
.github/workflows/ci.yml uvdocs: https://docs.astral.sh/uv/- PyPI Trusted Publishers: https://docs.pypi.org/trusted-publishers/
Usage
The main entry point is run.py, which consumes one or more YAML configs and writes results under runs/ by default. After uv sync, you can call it via uv run python run.py ... or the installed console scripts:
uv run benchaudit ...(primary)uv run bench ...(legacy alias)
# Analyze all configs in a folder
uv run python run.py --configs configs --out-root runs
# or: uv run benchaudit --configs configs --out-root runs
# Analyze a single config and train baselines
uv run python run.py --config configs/example.yml --benchmark
# or: uv run benchaudit --config configs/example.yml --benchmark
Outputs per config:
summary.json: split sizes, hygiene counts, similarity and conflict statistics.records.csv: per-row view with cleaned SMILES, labels, and split tags.conflicts.jsonl: detailed conflict rows.cliffs.jsonl: detailed activity cliff rows.sequence_alignments.jsonl: (DTI only) top alignments between splits.performance.json: (when--benchmark) baseline model metrics and predictions.
Project layout
run.py: CLI runner that loads configs, builds loaders/analyzers, and writes artifacts.utils/: loaders, analyzers, baseline helpers, and logging utilities.configs/: example YAML configurations for supported datasets.data/,runs/: expected data and output locations (not tracked).
Development
- Code style: keep changes simple, PEP 8-ish. Add short docstrings for public functions.
- Typing: prefer explicit, lightweight type hints when types are clear.
- Tests: run
python -m unittest discover -s tests -p "test_*.py"(orpytest testsif pytest is installed). - Test data: tiny dummy benchmark datasets live under
tests/data/. - Benchmark/analysis docs: run
python scripts/generate_benchmark_analysis_class_docs.py --output docs/benchmark_and_analysis_class_reference.mdto regenerate the class reference; CI enforces freshness via.github/workflows/benchmark-analysis-docs.yml. - Optional extras: Polaris datasets require
polaris-lib; sequence alignment requirespairwise-sequence-alignmentand EMBOSS binaries.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchaudit-0.1.0.tar.gz.
File metadata
- Download URL: benchaudit-0.1.0.tar.gz
- Upload date:
- Size: 55.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd5b22002de6db64196c11adf4e115c78fff1b561c22b6456eb1ea66c0644add
|
|
| MD5 |
b217a7e166fdbdd39351fb3ac46f8540
|
|
| BLAKE2b-256 |
1682a9b8677967bde3247a214dd65d3d61f7c6867f120c75f2ade26f36aa2a94
|
Provenance
The following attestation bundles were made for benchaudit-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on sieber-lab/benchaudit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
benchaudit-0.1.0.tar.gz -
Subject digest:
dd5b22002de6db64196c11adf4e115c78fff1b561c22b6456eb1ea66c0644add - Sigstore transparency entry: 1009342488
- Sigstore integration time:
-
Permalink:
sieber-lab/benchaudit@5995e5320a0157fbdf1dfb151bb17355c65af724 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/sieber-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5995e5320a0157fbdf1dfb151bb17355c65af724 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file benchaudit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: benchaudit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 52.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07b710336ca97f0483cde18df3c89a7bc74417aa47c31f105bd13c141ad3628f
|
|
| MD5 |
2877e56ab057eb6f57eda74657a0029a
|
|
| BLAKE2b-256 |
dfd0d18c2bd1c74b9e65bb9ad15d49b91b3c1607a85b67f01762483eff194432
|
Provenance
The following attestation bundles were made for benchaudit-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on sieber-lab/benchaudit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
benchaudit-0.1.0-py3-none-any.whl -
Subject digest:
07b710336ca97f0483cde18df3c89a7bc74417aa47c31f105bd13c141ad3628f - Sigstore transparency entry: 1009342500
- Sigstore integration time:
-
Permalink:
sieber-lab/benchaudit@5995e5320a0157fbdf1dfb151bb17355c65af724 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/sieber-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5995e5320a0157fbdf1dfb151bb17355c65af724 -
Trigger Event:
workflow_dispatch
-
Statement type: