Skip to main content

Library for computing molecular fingerprint based similarities as well as dimensionality reduction based chemical space visualizations.

Project description

GitHub License PyPI GitHub Actions Workflow Status Powered by RDKit

chemap - Mapping chemical space

Library for computing molecular fingerprint based similarities as well as dimensionality reduction based chemical space visualizations.

Installation

chemap can be installed using pip.

pip install chemap

Or, to include UMAP computation abilities on either CPU or GPU chose one of the following option:

  • CPU version: pip install "chemap[cpu]"
  • GPU version (CUDA 12): pip install "chemap[gpu-cu12]"
  • GPU version (CUDA 13): pip install "chemap[gpu-cu13]"

Fingerprint computations

Fingerprints can be computed using generators from RDKit or scikit-fingerprints. This includes popular fingerprint types such as:

Path-based and circular fingerprints

  • RDKit fingerprints
  • Morgan fingerprints

Predefined substructure fingerprints

  • MACCS fingerprints
  • PubChem fingerprints
  • Klekota-Roth fingerprints

Topological distance based fingerprints

  • Atom pair fingerprints
  • MAP4 fingerprints

Here a code example:

import numpy as np
import scipy.sparse as sp
from rdkit.Chem import rdFingerprintGenerator
from skfp.fingerprints import MAPFingerprint, AtomPairFingerprint

from chemap import compute_fingerprints, DatasetLoader, FingerprintConfig


ds_loader = DatasetLoader()
smiles = ds_loader.load("tests/data/smiles.csv")

# ----------------------------
# RDKit: Morgan (folded, dense)
# ----------------------------
morgan = rdFingerprintGenerator.GetMorganGenerator(radius=3, fpSize=4096)
X_morgan = compute_fingerprints(
    smiles,
    morgan,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=False,   # dense numpy
        invalid_policy="raise",
    ),
)
print("RDKit Morgan:", X_morgan.shape, X_morgan.dtype)

# -----------------------------------
# RDKit: RDKitFP (folded, CSR sparse)
# -----------------------------------
rdkitfp = rdFingerprintGenerator.GetRDKitFPGenerator(fpSize=4096)
X_rdkitfp_csr = compute_fingerprints(
    smiles,
    rdkitfp,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=True,    # SciPy CSR
        invalid_policy="raise",
    ),
)
assert sp.issparse(X_rdkitfp_csr)
print("RDKit RDKitFP (CSR):", X_rdkitfp_csr.shape, X_rdkitfp_csr.dtype, "nnz=", X_rdkitfp_csr.nnz)

# --------------------------------------------------
# scikit-fingerprints: MAPFingerprint (folded, dense)
# --------------------------------------------------
# MAPFingerprint is a MinHash-like fingerprint (different from MAP4 lib).
map_fp = MAPFingerprint(fp_size=4096, count=False, sparse=False)
X_map = compute_fingerprints(
    smiles,
    map_fp,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=False,
        invalid_policy="raise",
    ),
)
print("skfp MAPFingerprint:", X_map.shape, X_map.dtype)

# ----------------------------------------------------
# scikit-fingerprints: AtomPairFingerprint (folded, CSR)
# ----------------------------------------------------
atom_pair = AtomPairFingerprint(fp_size=4096, count=False, sparse=False, use_3D=False)
X_ap_csr = compute_fingerprints(
    smiles,
    atom_pair,
    config=FingerprintConfig(
        count=False,
        folded=True,
        return_csr=True,
        invalid_policy="raise",
    ),
)
assert sp.issparse(X_ap_csr)
print("skfp AtomPair (CSR):", X_ap_csr.shape, X_ap_csr.dtype, "nnz=", X_ap_csr.nnz)

# (Optional) convert CSR -> dense if you need a NumPy array downstream:
X_ap = X_ap_csr.toarray().astype(np.float32, copy=False)

UMAP Chemical Space Visualization

chemap provides functions to compute UMAP coordinates based on molecular fingerprints. Depending on your system and installation, this can be either via a very fast cuml library by using create_chem_space_umap_gpu, which then only allows to use "cosine" as a metric, as well as folded/fixed sized fingerprints. The alternative is a numba-based variant create_chem_space_umap (so this is still optimized, but much slower than the GPU version). While this is slower, it in return allows to use Tanimoto as a metric and can also handle unfolded fingerprints.

Example:

from rdkit.Chem import rdFingerprintGenerator
from chemap.plotting import create_chem_space_umap, scatter_plot_hierarchical_labels

data_plot = create_chem_space_umap(
    data_compounds,  # dataframe with smiles and class/subclass etc. information
    col_smiles="smiles",
    inplace=False,
    x_col="x",
    y_col="y",
    fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=9, fpSize=4096),
)

# Plot
fig, ax, _, _  = scatter_plot_hierarchical_labels(
    data_plot,
    x_col="x",
    y_col="y",
    superclass_col="Superclass",
    class_col="Class",
    low_superclass_thres=2500,
    low_class_thres=5000,
    max_superclass_size=10_000,

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemap-0.3.0.tar.gz (48.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chemap-0.3.0-py3-none-any.whl (54.5 kB view details)

Uploaded Python 3

File details

Details for the file chemap-0.3.0.tar.gz.

File metadata

  • Download URL: chemap-0.3.0.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chemap-0.3.0.tar.gz
Algorithm Hash digest
SHA256 fe7e4902728f324d742ed96a0dc478adfbeee628976fe190343219fd7c1be713
MD5 7af41e5cba0f57a4f28c9797d9098b57
BLAKE2b-256 b29191706a80d17f73f9af285967015907142468941c8f1ed9810bacde64224a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chemap-0.3.0.tar.gz:

Publisher: CI_publish.yaml on matchms/chemap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chemap-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: chemap-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 54.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chemap-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 664aa59e586e2b49fd09149fda11f318bcbe73e8de1b21be1b08c499c49e4871
MD5 a6ea9e716a0eada135f4c536bb054705
BLAKE2b-256 e7bf77aa6abbfcdb2c59a5a0fed72572b65cd22271a3b85eee9d061a3b72a314

See more details on using hashes here.

Provenance

The following attestation bundles were made for chemap-0.3.0-py3-none-any.whl:

Publisher: CI_publish.yaml on matchms/chemap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page