Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.10.2.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.10.2-cp313-cp313-win_amd64.whl (188.0 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.10.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (215.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.2-cp313-cp313-macosx_10_13_universal2.whl (304.4 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.10.2-cp312-cp312-win_amd64.whl (187.9 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (215.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.2-cp312-cp312-macosx_10_13_universal2.whl (304.3 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.10.2-cp311-cp311-win_amd64.whl (186.2 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (214.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.2-cp311-cp311-macosx_10_9_universal2.whl (302.5 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.10.2.tar.gz.

File metadata

  • Download URL: bblean-0.10.2.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.2.tar.gz
Algorithm Hash digest
SHA256 eb73e74f94d906d2ff05848a682ec1a6ce8b58bc6d6413b5f925a049e0a39fff
MD5 18bdf3aac4d36b978604d711e1ee96e8
BLAKE2b-256 2115c4c1aadf445e656f8d8cae815fa17bc17c1b61310a1de4617235102b7302

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 188.0 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 9fbdb351f07ee9c489bb93529d5ef01027a1a164c33cfadf7b184ee26315029f
MD5 f01257e3adae216d74e3486f757b7aad
BLAKE2b-256 99c9189e7cc55124b23d7df5bac916ea647f2566c57a4ec9a7c996cee8bfdb2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a1d8244f400f6706d73e3c89301434449dcae633998e36c393b3a578ec45dd9a
MD5 aae50febf08ab68ce32c62da2530810e
BLAKE2b-256 e97221c79ebfa68ef2b94291390c3ab406d874a9ec13faf64627e75135699ef7

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.2-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 0995d60a10904e0c496c2046ff2492ec2efbadcf20503877c4d4c5194488991c
MD5 16940fbfa99b42f585c6255c323bd9c7
BLAKE2b-256 8b574a7b2bb09107cb4d8edb723718bf8f494e43e66bd0ee600cdf731d029df6

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 187.9 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 8243008b5bde199ce284b0b00dcfbd12afcb93869462a30fc1b6152541186eda
MD5 5d74e1d56d82dae8328e5ec6d11e93f0
BLAKE2b-256 c7b3bda2f8cba025d1051ba1fc80334e3fc32dfafca64be600075a60b4f30ec5

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 aee8b5c17db2a8822c9d31ebd49fea7df2c5aa5d404a409965f6ee653586de21
MD5 8ca9fb82b350f3ba0833458bca5dff74
BLAKE2b-256 c7de52bb4b57d9064cb04e28a48b9cb4ed6f7009b5b4a38484f713f1d5232f3f

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.2-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 debe19f766781cdbb56be51cc66db6e6ac03afecf5606f93bb9da91ff6fdcc08
MD5 f9fd5ea4760489c86f5b966a3fdcb4c7
BLAKE2b-256 31c1a1cea7b0602061219dcd130215d64848956639c9076abc5889338293c20e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 186.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 fbc8159ff255d80aff6933f22e264d7df249297c0dd813bf5a89afdc697ca978
MD5 89e1c16831572e37ea40ee3c22bf8fc6
BLAKE2b-256 cd13e32b5c59d40ba97413a4bd7362347f4c15ba40f8787431ee86d4938f674e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3e4f704a4b0d5818faba5400ae66bfbf5f64e4b0887c1c746a9918c63fbbe72b
MD5 fe3a9f234b1f1b9c79357ca9f7a2b78d
BLAKE2b-256 d7b72434be7970a702b1273e1f965450552aaae932dd8cf6c942e6be6fb07647

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.2-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.2-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 0644d0991e99bda1305797695ec1b4dbbfb9b687091f19c2077fbd6552968735
MD5 764fcb6a763555ba0cf3f5ae1a7de8e1
BLAKE2b-256 c3fc873640ea9bcaa99bb27604f807578eae3b5cfa6794c4bd6917e5592ab680

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.2-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page