Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.8.1.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.8.1-cp313-cp313-win_amd64.whl (180.1 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.8.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (206.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.8.1-cp313-cp313-macosx_10_13_universal2.whl (288.8 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.8.1-cp312-cp312-win_amd64.whl (180.0 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.8.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (206.0 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.8.1-cp312-cp312-macosx_10_13_universal2.whl (288.8 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.8.1-cp311-cp311-win_amd64.whl (178.7 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.8.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (203.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.8.1-cp311-cp311-macosx_10_9_universal2.whl (287.4 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.8.1.tar.gz.

File metadata

  • Download URL: bblean-0.8.1.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.1.tar.gz
Algorithm Hash digest
SHA256 039a522af5861ae36429c6129a1398afa432b714a1c0b45abce3426f5f61d2ca
MD5 f17ec037d9a3cd6aa6314e0bc70764fb
BLAKE2b-256 d3f1859c2ea56fb547c2a5a72e2153cb8a9cb7220ddfca8c45fea0140e1aa8fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.8.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 180.1 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 7a4af7c4b1de169587dad046399acc3e54f678dd6f0724b09506bfeccfe03238
MD5 0aae6f5a2d891d681519ad3dcb6f3041
BLAKE2b-256 0675856a3dff2145fc1c27d0a0926d4f1f8e2a5f0e902972b564b5cb1fcbc4a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.8.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 548e4ada4e8590cdcc685a4c17c33354303db75963959ffe4cfdee91ed6e4ff4
MD5 aeb616c82ea3ba4066c6295692d1fcee
BLAKE2b-256 268bcc9f33388635112fd6340b8e30a83e2fb50084f7df06c53c79a43fab7024

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.8.1-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 6e681aa8eca45fa99626d19d929c38466da023378cf60e3cdef41e24f5aed52e
MD5 042063d4bbb74c821cc6bd030e4debaa
BLAKE2b-256 ac5f5ffd2a216628ac0f0f5988c1a42807c827393239c5e0bc05df10aec589e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.8.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 180.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0c61e9a04d76cd6ecce720362f3b78f59af84ed76b3754d2cad596205d5fbf7e
MD5 bf9fda1ca5b5637755a7816d2a872585
BLAKE2b-256 848f412761698a795279100c9ff1a4f6434a81f5ef54e64ef2a66015143baee6

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.8.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a01d9696dc8c81a3c526df3cb745bc441317207c18713e23c388d9f04d516f07
MD5 225e89040711390dda63936944c8b9c3
BLAKE2b-256 e4506e3e20c8d4b10241815112d792b15a6c7a3bf469372456e579e6ac87f6f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.8.1-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 d4c599b774ea5faddf021f12b31a56504830bb9ee705282c2156149f2232b0d3
MD5 0815b1b50e7ee67662154c8613e15d0d
BLAKE2b-256 6d30580972dc308b995341c8f2dbe9cad84aebf609920d0a65494d8d11a702e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.8.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 178.7 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.8.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 21facf8c0c4b76d191ed4e071478720522433b75679cd02b6c79db605b9d352a
MD5 0b70e652e16574829a090f6a0810d657
BLAKE2b-256 9d9d198f4b909ad8cef92c8c92787c8b4bc8f9f4c6e4c5cd459943220e045b0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.8.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bb18022e03c8c643d2b87ef8cd371af5d65b242ecae825cee8730203932c19e2
MD5 5f219eed34bd9179cca152b9e692f4a4
BLAKE2b-256 a7352088517235c70a98b2f086adea052b61b3286d38506d795c8178aa40c165

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.8.1-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.8.1-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 d9cd78bfe1b87cabe9fdf79ca334c0c1b58b4e5e750ec1bae2f0230fe2fc0c02
MD5 485c58f12adc35641ecf26304b8c7df3
BLAKE2b-256 60fde4001d2e1e9f600623c62814277715bc5b1e5e724f1cacc57c489e1e5154

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.8.1-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page