Skip to main content

BitBirch-Lean Python package

Project description

BitBIRCH-Lean logo

DOI License: GPL v3 CI Code style: black Code coverage

Overview

BitBIRCH-Lean is a high-throughput implementation of the BitBIRCH clustering algorithm designed for very large molecular libraries.

If you find this software useful please cite the following articles:

NOTE: BitBirch-Lean is currently beta software, expect minor breaking changes until we hit version 1.0

The documentation of the developer version is a work in progress. Please let us know if you find any issues.

⚠️ Important: The default threshold is 0.3 and the default fingerprint kind to ecfp4. We recommend setting threshold to 0.5-0.65 for rdkit fingerprints and 0.3-0.4 for ecfp4 or ecfp6 fingerprints (although you may need further tuning for your specific library / fingerprint set). For more information on tuning these parameters see the best practices and parameter tuning guides.

Installation

BitBIRCH-Lean requires Python 3.11 or higher, and can be installed in Windows, Linux or macOS via pip, which automatically includes C++ extensions:

pip install bblean
bb --help

We recommend installing bblean in a conda environment or a venv.

Memory usage and C++ extensions are most optimized for Linux / macOS. We support windows on a best-effort basis, some releases may not have Windows support.

From source

To build from source instead (editable mode):

git clone git@github.com:mqcomplab/bblean,
cd bblean

conda env create --file ./environment.yaml
conda activate bblean

BITBIRCH_BUILD_CPP=1 pip install -e .

# If you want to build without the C++ extensions run this instead:
pip install -e .

bb --help

If the extensions install successfully, they will be automatically used each time BitBirch-Lean or its classes are used. No need to do anything else.

If you run into any issues when installing the extensions, please open a GitHub issue and tag it with C++.

CLI Quickstart

BitBIRCH-Lean provides a convenient CLI interface, bb. The CLI can be used to convert SMILES files into compact fingerprint arrays, and cluster them in parallel or serial mode with a single command, making it straightforward to triage collections with millions of molecules. The CLI prints a run banner with the parameters used, memory usage (when available), and elapsed timings so you can track each job at a glance.

The most important commands you need are:

  • bb fps-from-smiles: Generate fingerprints from a *.smi file.
  • bb run or bb multiround: Cluster the fingerprints
  • bb plot-summary or bb plot-tsne: Analyze the clusters

An example usual workflow is as follows:

  1. Generate fingerprints from SMILES: The repository ships with a ChEMBL sample that you can use right away for testing:

    bb fps-from-smiles examples/chembl-33-natural-products-sample.smi
    

    This writes a packed fingerprint array to the current working directory (use --out-dir <dir> for a different location). The naming convention is packed-fps-uint8-508e53ef.npy, where 508e53ef is a unique identifier (use --name <name> if you prefer a different name). The packed uint8 format is required for maximum memory-efficient, so keep the default --pack and --dtype values unless you have a very good reason to change them. You can optionally split over multiple files for parallel parallel processing with --num-parts <num>.

  2. Cluster the fingerprints: To cluster in serial mode, point bb run at the generated array (or a directory with multiple *.npy files):

    bb run ./packed-fps-uint8-508e53ef.npy
    

    The outputs are stored in directory such as bb_run_outputs/504e40ef/, where 504e40ef is a unique identifier (use --out-dir <dir> for a different location). Additional flags can be set to control the BitBIRCH --branching, --threshold, and merge criterion. Optionally, cluster refinement can be performed with --refine-num 1. bb run --help for details.

    To cluster in parallel mode, use bb multiround ./file-or-dir instead. If pointed to a directory with multiple *.npy files, files will be clustered in parallel and sub-trees will be merged iteratively in intermediate rounds. For more information: bb multiround --help. Outputs are written by default to bb_multiround_outputs/<unique-id>/.

  3. Visualize the results: You can plot a summary of the largest clusters with bb plot-summary <output-path> --top 20 (largest 20 clusters). Passing the optional --smiles <path-to-file.smi> argument additionally generates Murcko scaffold analysis. For a t-SNE visualization try bb plot-tsne <output-path> -- top 20. t-SNE plots use openTSNE as a backend, which is a parallel, extremely fast implementation. We recommend you consult the corresponding documentation for info on the available parameters. Still, expect t-SNE plots to be slow for very large datasets (more than 1M molecules).

Manually exploring clustering results

Every run directory contains a raw clusters.pkl file with the molecule indices for each cluster, plus metadata in *.json files that captures the exact settings and performance characteristics. A quick Python session is all you need to get started:

import pickle

clusters = pickle.load(open("bb_run_outputs/504e40ef/clusters.pkl", "rb"))
clusters[:2]
# [[321, 323, 326, 328, 337, ..., 9988, 9989],
#  [5914, 5915, 5916, 5917, 5918, ..., 9990, 9991, 9992, 9993]]

The indices refer to the position of each molecule in the order they were read from the fingerprint files, making it easy to link back to your original SMILES records.

Python Quickstart and Examples

For an example of how to use the main bblean classes and functions consult examples/bitbirch_quickstart.ipynb. The examples/dataset_splitting.ipynb notebook contains an adapted notebook by Pat Walters (Some Thoughts on Splitting Chemical Datasets). More examples will be added soon!

A quick summary:

import pickle

import matplotlib.pyplot as plt
import numpy as np

import bblean
import bblean.plotting as plotting
import bblean.analysis as analysis

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("./examples/chembl-33-natural-products-sample.smi")
fps = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

# Fit the figerprints (by default all bblean functions take *packed* fingerprints)
# A threhsold of 0.5-0.65 is good for rdkit fingerprints, a threshold of 0.3-0.4
# is better for ECFPs
tree = bblean.BitBirch(branching_factor=50, threshold=0.65, merge_criterion="diameter")
tree.fit(fps)

# Refine the tree (if needed)
tree.set_merge(merge_criterion="tolerance-diameter", tolerance=0.0)
tree.refine_inplace(fps)

# Visualize the results
clusters = tree.get_cluster_mol_ids()
ca = analysis.cluster_analysis(clusters, fps, smiles)
plotting.summary_plot(ca, title="ChEMBL Sample")
plt.show()

# Save the resulting clusters, metrics, and fps
with open("./clusters.pkl", "wb") as f:
    pickle.dump(clusters, f)
ca.dump_metrics("./metrics.csv")
np.save("./fps-packed-2048.npy", fps)

Public Python API and Documentation

By default all functions take packed fingerprints of dtype uint8. Many functions support an input_is_packed: bool flag, which you can toggle to False in case for some reason you want to pass unpacked fingerprints (not recommended).

  • Functions and classes that end in an underscore are considered private (such as _private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All functions and classes that are in modules that end with an underscore are also considered private (such as bblean._private_module.private_function(...)) and should not be used, since they can be removed or modified without warning.
  • All other functions and classes are part of the stable public API and can be used. However, expect minor breaking changes before we hit version 1.0

Contributing

If you find a bug in BitBIRCH-Lean or have an issue with the usage or documentation please open an issue in the GitHub issue tracker.

If you want to contribute to BitBIRCH-Lean with a bug fix, improving the documentation, with usability, maintainability, or performance, please open an issue with your idea/request (or directly open a PR from a fork if you prefer).

Currently we don't directly accept PRs with new features that have not been extensively validated, but if you have an idea to improve the BitBIRCH algorithm you may want to contact the Miranda-Quintana Lab, we are open to collaborations.

To contribute, first create a fork, then clone your fork (git clone git@github.com:<user>/bblean. We recommend you install pre-commit (pre-commit install --hook-type pre-push), which will run some checks before you push to your branch. After you have finished work on your branch, open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bblean-0.6.0b2.tar.gz (2.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bblean-0.6.0b2-cp313-cp313-win_amd64.whl (174.6 kB view details)

Uploaded CPython 3.13Windows x86-64

bblean-0.6.0b2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (199.2 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.6.0b2-cp313-cp313-macosx_10_13_universal2.whl (278.9 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.6.0b2-cp312-cp312-win_amd64.whl (174.5 kB view details)

Uploaded CPython 3.12Windows x86-64

bblean-0.6.0b2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (199.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.6.0b2-cp312-cp312-macosx_10_13_universal2.whl (278.8 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

bblean-0.6.0b2-cp311-cp311-win_amd64.whl (173.2 kB view details)

Uploaded CPython 3.11Windows x86-64

bblean-0.6.0b2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (197.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

bblean-0.6.0b2-cp311-cp311-macosx_10_9_universal2.whl (277.6 kB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file bblean-0.6.0b2.tar.gz.

File metadata

  • Download URL: bblean-0.6.0b2.tar.gz
  • Upload date:
  • Size: 2.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.6.0b2.tar.gz
Algorithm Hash digest
SHA256 1c53145c5d451775eacb7bd6822a2bc22d32fbb3b4ecb424db30315f3a9eacff
MD5 ce8011fc8419e99bb893c8ecb05ef862
BLAKE2b-256 ea981fb7d8620649f7f9b95a9c00f87e61b4d5ee8af3e1eaa62e2ca14136131e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2.tar.gz:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: bblean-0.6.0b2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 174.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.6.0b2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 0d20b7c7d0443366063e2407dcd97b314f5863ae98218fdc67fd69e99166e562
MD5 0aa1331beeada84ad72c7ddd47956250
BLAKE2b-256 621c48e1ff4120ef1f43c7ba8b408631dd986fefa3612b3fb89ed68aa4b15a29

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp313-cp313-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.6.0b2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9af6b6501cc4566ae30bd3e155d2ed8ea9b93e894bc9387a4f2ae94c1c7df2c9
MD5 a79c6b164c4219ecace6f2d80af8bacb
BLAKE2b-256 0c6779b9d7c697bfc72b5c80c2cc5f766b460f0830c067e362b700daa355b1fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.6.0b2-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 22d873169d5b9e38260a010efbcc8cc7a66b68e54e04b990803218e15c323ac1
MD5 3845c8eda336816200f1896194b76678
BLAKE2b-256 f95df5ab3c1148c969ade3f99a2e6512ee7d4c5bef5d2a47d0dfb1db56c381a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp313-cp313-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: bblean-0.6.0b2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 174.5 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.6.0b2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 34373bd687263a4d1edae25dd45a781dcd0d8e875f3e671a3e4a96d3c753b656
MD5 4bbbff5050e22f9cc67afdb5fb0ab668
BLAKE2b-256 4df3fe667c77b6ef41d7c461b26b50be8f97a85470a54a12c9ad305d3ad27a2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp312-cp312-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.6.0b2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 422d3114f6b1e2f28479f1889d2ec495c84763d9dbd319ca4cdd83076d1a1dd7
MD5 94cc89726dcaaec42e25331ecb2e4053
BLAKE2b-256 4536215e10fe7b81cd67ca7d2e0b646c8203e50d1944b545a8a8c0401b48200c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.6.0b2-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 251c5a98d65c32d24510583cbc96be860f9ea41675a1709bcce0b0acad358c60
MD5 2e3b1d724e8dd608dce9db884b0af0b2
BLAKE2b-256 8ecb2eefa699b48fecf5dc783262814d2b7d7e81155d81cee790b663c51529bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bblean-0.6.0b2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 173.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bblean-0.6.0b2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 9ba7263be1ea604d20d50b5abd01dd0cb3273d53e277539a154f56f617ec9ca4
MD5 c6f169f69de42672389278785d2cc909
BLAKE2b-256 f7b2cdd8317e46f56e520079158090a066df8216eac7b4ff222b9b405a16b811

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp311-cp311-win_amd64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bblean-0.6.0b2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ddab2f55db07c8a537a730311195f1579b4c0ae00d501a8be2a9071492929d13
MD5 70a47411369136d21a9d62cc62e21c2a
BLAKE2b-256 9759032776d74a85fad71b81f77bc404bdd093176d4c9fcb213fbcce11bd4d83

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bblean-0.6.0b2-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for bblean-0.6.0b2-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 086bba309b1fc9ae6ebf71a394e5eeb6b8c1a3c3d20f49acd57adfb8b802a2bd
MD5 f47db398b9326752102c4bfa69db254d
BLAKE2b-256 e3e3f3e0819c4e8e716ea01fbd0ad86de694e1a073ae178fa92c1794ba6c3954

See more details on using hashes here.

Provenance

The following attestation bundles were made for bblean-0.6.0b2-cp311-cp311-macosx_10_9_universal2.whl:

Publisher: upload-to-pypi.yaml on mqcomplab/bblean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page