Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.8.2.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.8.2-cp313-cp313-win_amd64.whl (180.8 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.8.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (206.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.8.2-cp313-cp313-macosx_10_13_universal2.whl (289.5 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.8.2-cp312-cp312-win_amd64.whl (180.8 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.8.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (206.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.8.2-cp312-cp312-macosx_10_13_universal2.whl (289.5 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.8.2-cp311-cp311-win_amd64.whl (179.5 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.8.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (204.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.8.2-cp311-cp311-macosx_10_9_universal2.whl (288.2 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.8.2.tar.gz.

File metadata

  • Download URL: bblean-0.8.2.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.2.tar.gz
Algorithm Hash digest
SHA256 f0059af54a9022f11cc5c36776eced53f8570a4a62114080673355362aab9b65
MD5 069453e93f9d1b4c56aa2c79c82aa295
BLAKE2b-256 43989c6575e7a5eebc92d326e181da6552b31fa4ca043ef5ec93c92b1bbc9e7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.8.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 180.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 9c378bb9eefcce5e84ed58508161c028fd2eb40fa498f153f6389f799eaf32ac
MD5 854e4788960da6b883d9c14092de9750
BLAKE2b-256 13c9f80c393eef6a14e656754686ef8ed7f7061ba1590f9d6628a2e3144b26ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.8.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0b663006620863105f3b24623453ea33be79d9f2fa5ca83973c1f1e6df45354e
MD5 f23c40f14d994ba2d2d96646b7dd5eb5
BLAKE2b-256 2630a8b7b77a95ec55dea415c25b71266430be20adad2e29d91f7be22d12c52d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.8.2-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 0b55d61a34a6ca05cc7141c75c039da42692feb591b7140af5575fe49418f660
MD5 d0ed55c406a87afe592d59338d1da5ba
BLAKE2b-256 07162498f154e12cf666367075644336c5b37d81090ff4f129dae3c410eeb70f

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.8.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 180.8 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ab465054b475d22aa846e3ab52d5eccb72191827cf8a9f5050d9afafa9118966
MD5 7bc086031352acb13579759d260e81bf
BLAKE2b-256 632c2028e461ec09b3644e0ab66e9d0bd8e4ba6e2e675da5cbf3761bc5b89403

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.8.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0f8e88e80589f17e5097adacee7a455a8f578a6ec81a91f8e76c7485cd8b152f
MD5 93cdd92bcf171520dd9a3dcaf01747eb
BLAKE2b-256 d371e259f27b1fa2d0c7f84233cd412e80dcaf0be43a9422df0358cec20bc0b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.8.2-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 e39052afef5f854559d9ae397ece8e4edc9e60b068147bc4142b24600c36c6a4
MD5 af6f7050b4f60d07887a7136c9599686
BLAKE2b-256 27eb59b06c22cbd4c768d10f83022a3b4de45cf115d8d83a7c7d3a54912b65c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.8.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 179.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3dba544ee50f9bf1ac5b03f0910070d9e8a7a32a0aa64f8916d75ba8ecf53dd0
MD5 f6d8f2ed2d13812be66807945de0b4d0
BLAKE2b-256 f433ce461d2d598095e03f7fbb5bf91caebfac1313029dfb033eb15cc005f02f

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.8.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5abbeb3ea2f1a38025284ede80747ab3de90b8405838264aee491ff6b0a28001
MD5 49a96f0b25d69b270b5ad525fd4cf0a3
BLAKE2b-256 4ce496d32ab2f616c58ea10e2d6159082efcc21325e4c9e803d0996bc68a1be7

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.2-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.8.2-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 a4d77dc15b565d32ca6c5852dc3757f882a315351c4c7ce4ed3f77324b880329
MD5 52aefbd8b3993a8a596ce6cc19ec7e9e
BLAKE2b-256 0bd014d2570d18d718c157ae9dbdfb7326075718fb564340809078a9d8a7806b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.2-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page