Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.10.1.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.10.1-cp313-cp313-win_amd64.whl (182.6 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.10.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (208.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.1-cp313-cp313-macosx_10_13_universal2.whl (291.3 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.10.1-cp312-cp312-win_amd64.whl (182.6 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (208.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.1-cp312-cp312-macosx_10_13_universal2.whl (291.3 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.10.1-cp311-cp311-win_amd64.whl (181.3 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (206.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.1-cp311-cp311-macosx_10_9_universal2.whl (290.0 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.10.1.tar.gz.

File metadata

  • Download URL: bblean-0.10.1.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.1.tar.gz
Algorithm Hash digest
SHA256 4ddffa5f464ef39ccbb2b260511a8341225a932a803af713d93b3ea1f0e48a24
MD5 4b4e46d742dd061586a7ecf88c5336c5
BLAKE2b-256 580addbddb3ea38509c08610f2275b13abb4dd5e440cab758cfe2ae3396b1a29

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 182.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 a00d8446c29a7a1f974348b4ba08d99c78faf6dc80265db17985db77dd36ba07
MD5 b5fe861925f59f11574dd192da077be6
BLAKE2b-256 3560f79807148bf8b4cd9055d16715e9eb8c7ff69ea27695112efb6d6ddfe161

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e022cd43ebfc39b9301903577188933c84fb0f8e9c3ac6128b0fbc0fc0adacbd
MD5 2e660608e53bf25f09bd5ef33733df03
BLAKE2b-256 6261fa6722d6d23d6bae791a37a9fce909384bbe3e9cea41d1edf09ca3159ee4

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.1-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 5696489954a0c052feb62350e1d009010641dcfcb2ee172255a7c1c0c87ec0b8
MD5 edc4b03f17b397a6084c9cfa81f4722d
BLAKE2b-256 8cb89ffd5697c3936ad0f8d6e22d213b79d0d17137d5ca18de160cf6a98f74c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 182.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 32bbda87430c34b2645468c90a3321b129d0fc1c7b94309ca956637a289a2c9a
MD5 984cdb5110098418a6b4cb2042e99a64
BLAKE2b-256 00936e0396607d2d4353f913ec2160c5b2b2ea66c1a0f501bbdde27606d35532

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e77cb39e65a84d86cdc57a98860c5c719552e23d8a9bd135653823e506feaa3e
MD5 497ba95e2d88f0ccf99e819d7565202f
BLAKE2b-256 b6cbc02d8e9cd46ec4d3064afc306ade1e89395718e115e35a8c12849e983a5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.1-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 b53d066a1e3d7696cddfe0e72e216e0428b5593019f667814d25a26c34b94a94
MD5 14d73530d9e3eabaea0a2cdfd92bbdee
BLAKE2b-256 a2d2ae92212a5ccf34f3340624cd405f3cc1a487517aad369ca9753920e02b05

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 181.3 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e45d775ea2b9b2aeae227479316d9b10b4b60cc43c308b7b368860a67f3e20c7
MD5 4066d7c47d79e2e3bfd37d7635a94085
BLAKE2b-256 0f327e141078b2b409bb91c71185c6b9c9cc4ee1e660e09e1610d20a3278007e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c3600edc125591c6e5e1cd540f524d3bc1c24a5f2b4aed528f7d561ec28c12ad
MD5 afc97191c8da6700b429cff89f4dc408
BLAKE2b-256 e295ff0832012030369e0d207097a067dd8990a1448e26364556f2745b561f46

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.1-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.1-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 5906565bf4216af53b8770c7fc1f0f53b87f6fd03dc25d8892c99a65e91e58ed
MD5 6f7d5159374e8b2225a5caed5fa9b07a
BLAKE2b-256 a71e3fc91ed92afa1a956f00f56f6e4b4e547f1507a8ec2cf706af71fd14db26

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.1-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page