Skip to main content

BenchAudit -- data hygiene and similarity audits for molecular and DTI benchmarks.

Project description

BenchAudit

CI Publish to PyPI Python 3.10+ Docs Website License: GPL v3

BenchAudit is a lightweight pipeline for auditing molecular property and drug–target interaction benchmarks. It standardizes SMILES strings, checks split hygiene, surfaces label conflicts and activity cliffs, and can run simple baseline models. Outputs are machine‑readable summaries and drill‑down tables you can inspect or feed into other tools.

Features

  • Config‑driven analysis of tabular, TDC, Polaris, and DTI datasets.
  • SMILES standardization with optional REOS alerts and configurable fingerprint settings.
  • Split hygiene reports: duplicates, cross‑split contamination, and nearest‑neighbor similarity.
  • Conflict and activity‑cliff detection for classification and regression tasks.
  • DTI extras: sequence normalization, cross‑split pair conflicts, and EMBOSS stretcher alignment summaries.
  • Optional simple baselines for quick performance sanity checks.

Installation

From source with uv

BenchAudit uses a standard pyproject.toml. The quickest source setup is with uv:

# 1) Create a virtual environment
uv venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# 2) Install dependencies declared in pyproject.toml
uv sync

If you need optional sequence alignment support, install EMBOSS so stretcher is available (e.g., sudo apt install emboss on Debian/Ubuntu).

Usage

The main entry point is run.py, which consumes one or more YAML configs and writes results under runs/ by default. After uv sync, you can call it via uv run python run.py ... or the installed console scripts:

  • uv run benchaudit ... (primary)
  • uv run bench ... (legacy alias)
# Analyze all configs in a folder
uv run python run.py --configs configs --out-root runs
# or: uv run benchaudit --configs configs --out-root runs

# Analyze a single config and train baselines
uv run python run.py --config configs/example.yml --benchmark
# or: uv run benchaudit --config configs/example.yml --benchmark

Outputs per config:

  • summary.json: split sizes, hygiene counts, similarity and conflict statistics.
  • records.csv: per-row view with cleaned SMILES, labels, and split tags.
  • conflicts.jsonl: detailed conflict rows.
  • cliffs.jsonl: detailed activity cliff rows.
  • sequence_alignments.jsonl: (DTI only) top alignments between splits.
  • performance.json: (when --benchmark) baseline model metrics and predictions.

Project layout

  • run.py: CLI runner that loads configs, builds loaders/analyzers, and writes artifacts.
  • utils/: loaders, analyzers, baseline helpers, and logging utilities.
  • configs/: example YAML configurations for supported datasets.
  • data/, runs/: expected data and output locations (not tracked).

Development

  • Tests: run python -m unittest discover -s tests -p "test_*.py" (or pytest tests if pytest is installed).
  • Test data: tiny dummy benchmark datasets live under tests/data/.
  • Docs: build with sphinx-build -W --keep-going -b html docs/source docs/_build/html (docs/source/reference/api_objects.rst provides autosummary-based API inventory).
  • Optional extras: Polaris datasets require polaris-lib; sequence alignment requires pairwise-sequence-alignment and EMBOSS binaries.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchaudit-0.1.1.tar.gz (82.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchaudit-0.1.1-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file benchaudit-0.1.1.tar.gz.

File metadata

  • Download URL: benchaudit-0.1.1.tar.gz
  • Upload date:
  • Size: 82.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for benchaudit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2a32d7ce1ab6072ac6ed3246a33877f777284ef1039ba8faac1b1fe3f05285b4
MD5 46cffed41bc779b43fb49710826a7e10
BLAKE2b-256 eb7ae9cb59e109b2c3d1579c988b55177fb6bc8104bc60845bd33ee7acf84c2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchaudit-0.1.1.tar.gz:

Publisher: publish-pypi.yml on sieber-lab/benchaudit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file benchaudit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: benchaudit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 64.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for benchaudit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8f5c04a2411e0d1f01ce51348099c8164fab77bfec4d3650b065706087781b43
MD5 fa74d3ef90efe0d824d874453b77be8e
BLAKE2b-256 5078b4883f09729678229d6001dd07dcf1f7766dd9d2399dd0089fb32011ed55

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchaudit-0.1.1-py3-none-any.whl:

Publisher: publish-pypi.yml on sieber-lab/benchaudit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page