Skip to main content

BenchAudit -- data hygiene and similarity audits for molecular and DTI benchmarks.

Project description

BenchAudit

CI Publish to PyPI PyPI version Python versions License: MIT

BenchAudit is a lightweight pipeline for auditing molecular property and drug–target interaction benchmarks. It standardizes SMILES strings, checks split hygiene, surfaces label conflicts and activity cliffs, and can run simple baseline models. Outputs are machine‑readable summaries and drill‑down tables you can inspect or feed into other tools.

Features

  • Config‑driven analysis of tabular, TDC, Polaris, and DTI datasets.
  • SMILES standardization with optional REOS alerts and configurable fingerprint settings.
  • Split hygiene reports: duplicates, cross‑split contamination, and nearest‑neighbor similarity.
  • Conflict and activity‑cliff detection for classification and regression tasks.
  • DTI extras: sequence normalization, cross‑split pair conflicts, and EMBOSS stretcher alignment summaries.
  • Optional simple baselines for quick performance sanity checks.

Installation

From PyPI

Install the published package:

pip install benchaudit

or with uv:

uv pip install benchaudit

From source with uv

BenchAudit uses a standard pyproject.toml. The quickest source setup is with uv:

# 1) Create a virtual environment
uv venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# 2) Install dependencies declared in pyproject.toml
uv sync

If you need optional sequence alignment support, install EMBOSS so stretcher is available (e.g., sudo apt install emboss on Debian/Ubuntu).

Automated PyPI publishing

This repo includes .github/workflows/publish-pypi.yml for automated releases.

  1. In PyPI, configure a Trusted Publisher for this GitHub repository and workflow file (.github/workflows/publish-pypi.yml), using environment pypi.
  2. Bump project.version in pyproject.toml.
  3. Create and push a tag vX.Y.Z matching that version (for example v0.1.1).
  4. GitHub Actions builds with uv build and publishes to PyPI automatically when the repository visibility is public (publishing is skipped while private).

Detailed release and install documentation: docs/publishing_and_installation.md

References

Usage

The main entry point is run.py, which consumes one or more YAML configs and writes results under runs/ by default. After uv sync, you can call it via uv run python run.py ... or the installed console scripts:

  • uv run benchaudit ... (primary)
  • uv run bench ... (legacy alias)
# Analyze all configs in a folder
uv run python run.py --configs configs --out-root runs
# or: uv run benchaudit --configs configs --out-root runs

# Analyze a single config and train baselines
uv run python run.py --config configs/example.yml --benchmark
# or: uv run benchaudit --config configs/example.yml --benchmark

Outputs per config:

  • summary.json: split sizes, hygiene counts, similarity and conflict statistics.
  • records.csv: per-row view with cleaned SMILES, labels, and split tags.
  • conflicts.jsonl: detailed conflict rows.
  • cliffs.jsonl: detailed activity cliff rows.
  • sequence_alignments.jsonl: (DTI only) top alignments between splits.
  • performance.json: (when --benchmark) baseline model metrics and predictions.

Project layout

  • run.py: CLI runner that loads configs, builds loaders/analyzers, and writes artifacts.
  • utils/: loaders, analyzers, baseline helpers, and logging utilities.
  • configs/: example YAML configurations for supported datasets.
  • data/, runs/: expected data and output locations (not tracked).

Development

  • Code style: keep changes simple, PEP 8-ish. Add short docstrings for public functions.
  • Typing: prefer explicit, lightweight type hints when types are clear.
  • Tests: run python -m unittest discover -s tests -p "test_*.py" (or pytest tests if pytest is installed).
  • Test data: tiny dummy benchmark datasets live under tests/data/.
  • Benchmark/analysis docs: run python scripts/generate_benchmark_analysis_class_docs.py --output docs/benchmark_and_analysis_class_reference.md to regenerate the class reference; CI enforces freshness via .github/workflows/benchmark-analysis-docs.yml.
  • Optional extras: Polaris datasets require polaris-lib; sequence alignment requires pairwise-sequence-alignment and EMBOSS binaries.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchaudit-0.1.0.tar.gz (55.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchaudit-0.1.0-py3-none-any.whl (52.5 kB view details)

Uploaded Python 3

File details

Details for the file benchaudit-0.1.0.tar.gz.

File metadata

  • Download URL: benchaudit-0.1.0.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for benchaudit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dd5b22002de6db64196c11adf4e115c78fff1b561c22b6456eb1ea66c0644add
MD5 b217a7e166fdbdd39351fb3ac46f8540
BLAKE2b-256 1682a9b8677967bde3247a214dd65d3d61f7c6867f120c75f2ade26f36aa2a94

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchaudit-0.1.0.tar.gz:

Publisher: publish-pypi.yml on sieber-lab/benchaudit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file benchaudit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: benchaudit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for benchaudit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 07b710336ca97f0483cde18df3c89a7bc74417aa47c31f105bd13c141ad3628f
MD5 2877e56ab057eb6f57eda74657a0029a
BLAKE2b-256 dfd0d18c2bd1c74b9e65bb9ad15d49b91b3c1607a85b67f01762483eff194432

See more details on using hashes here.

Provenance

The following attestation bundles were made for benchaudit-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on sieber-lab/benchaudit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page