Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.7.2b0.tar.gz (4.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.7.2b0-cp313-cp313-win_amd64.whl (177.6 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.7.2b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (203.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.2b0-cp313-cp313-macosx_10_13_universal2.whl (286.4 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.7.2b0-cp312-cp312-win_amd64.whl (177.6 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.7.2b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (203.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.2b0-cp312-cp312-macosx_10_13_universal2.whl (286.4 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.7.2b0-cp311-cp311-win_amd64.whl (176.3 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.7.2b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (201.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.2b0-cp311-cp311-macosx_10_9_universal2.whl (285.1 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.7.2b0.tar.gz.

File metadata

  • Download URL: bblean-0.7.2b0.tar.gz
  • Upload date:
  • Size: 4.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.2b0.tar.gz
Algorithm Hash digest
SHA256 334529d776c1ea52df5136ada35460493594554a61cbeebc794787147713733a
MD5 132163753b7aa9feba3bcc318c2e26fd
BLAKE2b-256 07f6a04facdb3f26f6c8ac9f61bbe8c07707717470e3111ce77074cd81be0b71

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.2b0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 177.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.2b0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 2ff1b5520fad51bbc81d039b40fdc3927726ae2d61aa1f7d4f7a165ff0a922d9
MD5 cbb346e60b07fa95ec318ff8098a74fc
BLAKE2b-256 4f29ba5ddd7a4c1f2370fa22011cb6ad9cfed44c40e4b39c971699690890e7be

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.2b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bf531911695745edb42298a9730416567fbeb4dcf61dfafb7a6db5382f238453
MD5 a5812db22acfb0a7134a8bba654127be
BLAKE2b-256 e6bb31786934a1d2d7d79c0b7f4352c22e346b1f0e6d42ddbaa0fc4377ec2b5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.2b0-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 46ffffee0de5098c81a2a00eaa910d84c8afa6e010ae665c971869c7a03fdebf
MD5 5b4775ee275389cbd142deca850915a3
BLAKE2b-256 a06ffdcc9e7eafe2a7c372c45bdc49fac4deb65749de8ca8f23e61a2a9ab55a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.2b0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 177.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.2b0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 52baa54bffb85668fc534b0fd43dd46d76f2ec3ef2a395e24f5ceeaf4e32e175
MD5 9333c264fbb7355c0cd1c744174e3b2a
BLAKE2b-256 395c734e539ef797d218e9e20e6503ef8074c5ab2938be16a2b12ddd3f3cb43f

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.2b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f150233b9739e3bfee5866546fdbd689821d8d51d3a42656fae0e8d986ec5137
MD5 0f13c985e906c87fee86152b8b8d5ea0
BLAKE2b-256 0c865fbe532bf772a8a0949abba306e15d241decc18755ef0201d54498ec415c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.2b0-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 e56c8b295b1a90b2503fc7ce2f275c5151127b47b0368203c29c0e5e953d0eda
MD5 442db141c8869a168e34b1fbedd233fc
BLAKE2b-256 944a4d7cf1379eb6ad64a8607f472d995c3a6eab2cbf821cd8e7bbcee13df29c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.2b0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 176.3 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.2b0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 35bdb90ad3a8b1f97e78e1cef37b635ebdab6d1e4ac4b7f82fadeadfb7bd8196
MD5 dcb1cb0dd320dbacae94fd16fd1e89ef
BLAKE2b-256 db46ca2e69cdb9723b2a85ad1b362642040141a2dd1d212c382e1e6bec8ff882

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.2b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 34f1ccf110544de0efccfab0ae59867f5ed5439d1ed0b31c8ed8910b4466a020
MD5 4088c0049ea94073359ef82fed2c66f7
BLAKE2b-256 4ce684ef85f1bba36019c258ba582dfb689f05480a786a2b1da85b1d1fc7c3e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.2b0-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.2b0-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 caf23328e0a0fa0791431968f8e23faba38b34fddd34192106f8ae172a198309
MD5 e7a15449c4b774aba297f58555c522ff
BLAKE2b-256 8aaeb107b0871706a4c167521c4611e703eaedeb681eeaa9d540e02d76aca705

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.2b0-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page