Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.9.0.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.9.0-cp313-cp313-win_amd64.whl (182.4 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.9.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (208.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.9.0-cp313-cp313-macosx_10_13_universal2.whl (291.1 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.9.0-cp312-cp312-win_amd64.whl (182.3 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.9.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (208.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.9.0-cp312-cp312-macosx_10_13_universal2.whl (291.0 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.9.0-cp311-cp311-win_amd64.whl (181.0 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.9.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (205.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.9.0-cp311-cp311-macosx_10_9_universal2.whl (289.7 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.9.0.tar.gz.

File metadata

  • Download URL: bblean-0.9.0.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.9.0.tar.gz
Algorithm Hash digest
SHA256 0fef688a46a0b484b5ef7282ca0f066f272dea8d48e44f41123b0a573eb5731b
MD5 afdf44570cb52260c4c868a00f5de4cd
BLAKE2b-256 89c94edc447ae5f129a4fb1283e52db60b77b5df7ee64b2220cd0a82fbd9f4d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.9.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 182.4 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.9.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 4b662ad25ed61de03d4992e56fdfae6f0d2347f16aed1f43a690edf3c415617c
MD5 b9c6dcea8da0d160ccc95140145ad31a
BLAKE2b-256 e18faacb920b87185c01778e15539a52b26949872fb53ccefe5881eea0525479

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.9.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9e617b0e79d33eee596619828d9c211f2df239126e07e9ebecb1128288de50b3
MD5 2600c3b1db0b9bb34cb5bfb1f8a5f00a
BLAKE2b-256 1d1c2f62f41e7d2ef3e170cc1ed0bd6f143a672a1b06b02fed5b7c56fc781760

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.9.0-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 9544cff2ae5a24aa7cafd442c3c4224ad80bd60f4b590ea6ff3c65e8ac07456b
MD5 f2c130af19ab059d624448a892fe9485
BLAKE2b-256 77f646769aa86770eef282e73fb7322c92f172441fafc69bd9cb29b505f24b97

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.9.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 182.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.9.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5113212ad607c7ea577237da4d10e075645244d8c32957b9d329f560b0d3ba98
MD5 85dde9e762aabb7fae6318a2be492824
BLAKE2b-256 e525f8e58e71944113c6d259df903c34c78d1cb0a91912ce706bbdd87f83b17c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.9.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6033ff7eab365c49b8a80f70fc08067ae1aa99a52956d99707c33f4ee45a336b
MD5 00d1ea38364100fa502fbdd5f0135cb4
BLAKE2b-256 355356c4dbc831c1d5072ee43eb641c461642ed663c574942a318464e5c20ceb

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.9.0-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 7fa1d0f4680d3caaf5c0f157718a9e18d6553f162db5cb73e9c9d480d4f9acbc
MD5 8f99dbe0bb5b2418acbb280df29857b6
BLAKE2b-256 79f55b3896c3212cdfc2e62581f66f8fde4584c7407aa09dfa46b8291caa530f

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.9.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 181.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.9.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d0234d7abefdf92e6900a9d772504fa1476c2c4e5e05ce53a804c3f8e0bd3237
MD5 ea5a6e4bf460b797b81955fbb0d5ff8c
BLAKE2b-256 7cb417cf7075f3d65d2dd7dc37236065018c45e3e024bbe5c550d71636312366

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.9.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 002cb0c9471a0c9fa755bd1692a82d80a223b83a1e82ad63a1cd8eb546f1a312
MD5 6c81ce15bc437745dabb0a8187668d45
BLAKE2b-256 902867123d6cacac3adc642259a36a8b9ad7d94c5111bd7350525e492c1f3ba5

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.9.0-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.9.0-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 622fa54f8e7040f57995eea96f766ed1f60c51a8e067c009f769a42c9b5a4908
MD5 6fe9104b3d7959180278da2ff02fb428
BLAKE2b-256 7009dcc5df520098537d512989aeb348e7f1d32a74ff0037587b5c3269acd7fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.9.0-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page