Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.7.3b0.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.7.3b0-cp313-cp313-win_amd64.whl (177.6 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.7.3b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (203.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.3b0-cp313-cp313-macosx_10_13_universal2.whl (286.4 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.7.3b0-cp312-cp312-win_amd64.whl (177.6 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.7.3b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (203.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.3b0-cp312-cp312-macosx_10_13_universal2.whl (286.4 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.7.3b0-cp311-cp311-win_amd64.whl (176.3 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.7.3b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (201.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.7.3b0-cp311-cp311-macosx_10_9_universal2.whl (285.0 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.7.3b0.tar.gz.

File metadata

  • Download URL: bblean-0.7.3b0.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.3b0.tar.gz
Algorithm Hash digest
SHA256 dd675fbda466fa194d51cb0ad4e6bb5b2fd9019ba05f05c82331f97b39181ee3
MD5 f91ecb6b884acd85aab2452d280472ae
BLAKE2b-256 d48ab9f8a7d11289a5251ac603d8f5c287e7422c6f78bd1354728bc0a88e80ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.3b0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 177.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.3b0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 184e09c98327f0a983b9de0d972a4e4f7ee296ef3f4125192979a711cd0d6ff3
MD5 93ccdbd1e6d9b9cc15f48c1756840be2
BLAKE2b-256 fcab244d6b3702729a7aad6c5d52ee1d6016d8cbd52b9c1c311d6f056dbf9120

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.3b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d0621c25b7fbad99f7f217c42c7ba06eb70568d2d7e1a7e7f12132976803ac45
MD5 d236302ecca87a18318b13bdcb2e2c00
BLAKE2b-256 5fd1126da6d87e65aba31d25895c46cbe2bab2c4c6642fea4f6f912d9667535d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.3b0-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 92d8510a5a6adeec430ca43c6e07e66d13d9911662145378bf5e7168c49264a0
MD5 e722aa452a935939fb45bc5aef0c75b1
BLAKE2b-256 7b2887257954e15f163c99dcf6bb289a7ca3bdb30f471ae09d8a244d022d8b8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.3b0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 177.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.3b0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 843a91331fe7a227035dd6a6df01d9500d8e81963347cba3682b4aa69b29fb45
MD5 aca4e1221e71a6121bb423d9867046d8
BLAKE2b-256 c69c10387dd780c3e6c596ea0c6bb124be07e3f3b0db9917d954ee41704c939c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.3b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 808ca92ecfc8b669e2b83a398b09900f6d8a6631e4c3ebb6e62b5947c65ebf28
MD5 81ae9cfe2984c62c70427f490cbc2c71
BLAKE2b-256 2e05f46198177bb02dfee63867a26e7d9e4e449b9e8ef2262e0dedf691c5d95a

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.3b0-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 270c40200e856c2a34126412de1b2b3d42ae574ab4b489c3cd76e8d0be8923de
MD5 32ddca27a7227b8bb44e942a3afce03c
BLAKE2b-256 803740899d1c997194391ee104e7e4a5674f676fc481b91972457f07d47fdbac

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.7.3b0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 176.3 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.7.3b0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c508f5b2ad2388ea9107c8538494dc9371903fe922c4eb45df7a45865837cfef
MD5 5a7195598d25f87a513978273c84b7fb
BLAKE2b-256 68a46388c6d718ac418ecfe66d0f6df3e3775eddb4e9b0f7f3614cfc8b4988dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.7.3b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bff4a55ffaf2e0856865cd00159c9434c2e0ac1ac1003ab4d5d8c38ad34247bf
MD5 4a110dd0f2925d49c74db1ff033f1383
BLAKE2b-256 64cbae3b87040b4ed05a0ba86f35ad0c80665be488d40de9ecae394bb5dac86a

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.7.3b0-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.7.3b0-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 988a0c16ad39ac72eed71c9f361f1d2c0f42f0f51155bfe2ed946bcc5f69a8c1
MD5 f39b00ffc2f2babf8213a0f4ddf87d07
BLAKE2b-256 745a5cf990f359a0d5b0cba05da443494b731987cdc0e85c9b74c7dc43d98edb

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.7.3b0-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page