Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
# Alternatively you can use 'uv pip install'
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge("tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.10.3.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.10.3-cp313-cp313-win_amd64.whl (188.1 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.10.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (215.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.3-cp313-cp313-macosx_10_13_universal2.whl (304.4 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.10.3-cp312-cp312-win_amd64.whl (188.0 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.10.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (215.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.3-cp312-cp312-macosx_10_13_universal2.whl (304.4 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.10.3-cp311-cp311-win_amd64.whl (186.3 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.10.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (214.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.10.3-cp311-cp311-macosx_10_9_universal2.whl (302.6 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.10.3.tar.gz.

File metadata

  • Download URL: bblean-0.10.3.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.3.tar.gz
Algorithm Hash digest
SHA256 d98b955491ad496b0b6bfeac0b5e6276672b0ad5d6787be070ff46bc9b5196bd
MD5 a9a1f90d2f1916c4c74875fe784a55e8
BLAKE2b-256 a356aed08e340dbbe8c548d1c29caa3ae37c05c8b0ba680a0bf560e7e0cd7d48

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 188.1 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 258c2ef4275729da65331433125f962d2cdb2f6322d21eaee96d3f8f019810af
MD5 bfd92e1e1fef91f90ca06f4c7051947a
BLAKE2b-256 150703b318547c2470d584dfe3de34e4c0c2f8b55b65a4e3f8a1418e309efced

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2b4386f0a317855a9de1648df20a49fcc94dafecb1c700362566c4487342c0b8
MD5 aae7e54e0878d99b93074dda2bdc810f
BLAKE2b-256 20179cba3b7053302fbb0482e361c17a1190c5c7911710d5ec631b9b3366aeef

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.3-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 b0d46a2b01985f3dbc27591c53f5c83548216130370aec8ffae87ff42e702476
MD5 fd1bc822169cbd1d48d15f3a4d2db63c
BLAKE2b-256 1871d6667ce36037bbb2a7f99a519606b3c80c7c46adef685db97d783cdd62c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 188.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 36d10341de071024a0585020d5907d12950c93e251c03f74e92a14eee6a78fd9
MD5 6018f8cecf4f8008ece83d75fb1611b6
BLAKE2b-256 b72bf7f088fe70597d72f3705ee050e71110ca3a56e8bb5c1f954025c6f9fac9

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9380162d039d5ce1975803f6ce1aab3436575ecbaabe83fd9a44bb4fcc83dfff
MD5 81fb629e0256f77cadf5b98a4146a9b5
BLAKE2b-256 b890159360ce47346e4e0676d0177eae25b5c99bdfcf87c612d574c0fdfa6b66

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.3-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 985c73ed62da9ace4cf286e6c088c6c82bbed941fa6fc1992cc1c4ac55d0f322
MD5 d968dfea5e424a04a48aef9b80e79d3e
BLAKE2b-256 9b5e9ba27f86f88ef17639d6d76f33b608f63b644729f54b2317928c1ce5882c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.10.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 186.3 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.10.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 636a07d55c4712fc01c83c6c8fe8c90279f0e01bcc975a59f53c6f5aa5ffeb05
MD5 e77ea39bfad0d2ce2f0ec28c91217342
BLAKE2b-256 5150e81e0b63395538ca2777ad027708dd9164b9aaa71d61647218879ddcb844

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.10.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ad70fba599d2ce9c58c69ba583b89babd7bf0f003c9d892204e640690b45f6ad
MD5 a217ca8cd6f057924492c32f1e4a9604
BLAKE2b-256 5bed274649b4642616e33e9414a1283cffadc52f52e109b04d2cc01b063074d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.10.3-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.10.3-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 a13346d383a9fbd87455a7c9670887a67b7aca6e5e24448b3560f03426c4e26f
MD5 361895f6c2c96b609160b82a6342070a
BLAKE2b-256 c7529c38195df5c719dff53d6c474172ab43a9f095c88eec39d8de2be8aac16b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.10.3-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page