Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.7.5.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.7.5-cp313-cp313-win_amd64.whl (178.3 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.7.5-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (204.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.5-cp313-cp313-macosx_10_13_universal2.whl (287.0 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.7.5-cp312-cp312-win_amd64.whl (178.3 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.7.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (204.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.5-cp312-cp312-macosx_10_13_universal2.whl (287.0 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.7.5-cp311-cp311-win_amd64.whl (177.0 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.7.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (201.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.5-cp311-cp311-macosx_10_9_universal2.whl (285.7 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.7.5.tar.gz.

File metadata

  • Download URL: bblean-0.7.5.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.5.tar.gz
Algorithm Hash digest
SHA256 2a6b6339a53e6ce339d381b7342e11aec8ff15bc60a4f41e2e7e4ed26661b8ba
MD5 224a9f1d08999202bdab14f6b47e81cd
BLAKE2b-256 a29505c63d8a24018e21c7dd77788c222c2afdec6eefcf654b6c0347d6b20410

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.5-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 178.3 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 8572d7ed0f770c4867643bae39fbbe5db44ea060a69ff4fec5e03bb7ce1c6aee
MD5 cd3b9bc97abd7a7e6f6bf2d003f1aa8c
BLAKE2b-256 c2d3d285e4d3c846a3f749482a0a73c79aa519fa103dfe7c74c70572b5aee0d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.5-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 461abfc64eaabc789b0670ce689f636634e99a10ea71161890e2cd847bce37f0
MD5 6491968222814c4ff6dfa813e6534d50
BLAKE2b-256 b77859273110fff68e0fc9035181839312788736f8460e3cc79509cbbfdcf898

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.5-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 1f3bcdd30ffd4d35e24188bc9810276b5c1ef048820b86beb57f3b5d869404a9
MD5 05756fbf5557b09d8629aea9cac65de2
BLAKE2b-256 a5d1ef79d419ec21f28b5bceb3c15aafcaee232dad4f79b79af7496ea93090bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 178.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2a65aa048c5623e88481246fb9d331d1fa973b60e34b8fe959b716eb26c0c1d1
MD5 84312eafab402103526ba2cc236a3d01
BLAKE2b-256 fcbe8d038ae5e4bda4d0f20036fb01e2fd8f223bf99396bab0255d99d60ebc81

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0db8c24ae36fefe3c299e9eca51890924f9f01d5f7b116845d0dabb88815bad4
MD5 08f79ef76f3ec7113d5f938fdb47dd5d
BLAKE2b-256 8b61245a3257976382bef8e0e935ee5554fe83929ea5fff591f881293c66787b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.5-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 1aa4eb1d5db7e46dd13e41f1f0a267f9622e7cd4dc7bc659858d8ed95da2828b
MD5 4e243539ee209b200bd1a8e407ca1f7e
BLAKE2b-256 b7fcfbf5b68522d5b30f5d94cb9fb7f30b866d7ce162e19f0b26827e7248cc1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 177.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 83694faf8b034139c8d2c02e99c2b41277a6a0566b2e7648f6c0a7f6ccb70a80
MD5 b170dabd9b27f29152d11e7cf03334d8
BLAKE2b-256 0d9b18b6b3e6583cbfd4e2180e66cc498d1f1d00645504238b09d7529fbbf79b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4e5d1a42a9c831d8fd587bd72e93a480b818a51df5c945a01a04a3f6b9f2bfaf
MD5 2b4684add964e7c67cec61a6a357e6bd
BLAKE2b-256 594deed41e480ca05d8eb71d881ed35fc13aa9e610fb27b08aac5ba8994d78da

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.5-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.5-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 cbd67ec23b60880cb8cf01d5d53053ea868e0f0ad720584fdd5704979d1e9d64
MD5 e79e82dcea6cb11279b463002e0890ce
BLAKE2b-256 c51a9ec8a142a68c9942e819fcf0f1e471e7e4d24b0187d854b8aa224a5396ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.5-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page