Skip to main content

CASPER: Conformer-Averaged Surface Property Encoded Representation -- a tunable 3D molecular descriptor

Project description

CASPER

Conformer-Averaged Surface Property Encoded Representation — a tunable 3D molecular descriptor.

CASPER turns a molecule into a fixed-length feature vector by:

  1. generating ETKDG conformers (no force-field minimisation needed by default),
  2. building a van-der-Waals dot surface for each conformer,
  3. colouring each surface point by an atomic property (partial charge, logP, molar refractivity, TPSA, H-bond donor/acceptor, …),
  4. encoding the coloured surface as a property histogram (a 3D generalisation of MOE-style VSA descriptors) and/or a density-invariant spatial autocorrelation (VolSurf-flavoured), and
  5. pooling across conformers.

Unlike 2D VSA descriptors, CASPER is built on real 3D conformer surfaces, and unlike single-conformer VolSurf it averages over a conformer ensemble. Every step is exposed as a tunable parameter, and the per-conformer "bag" can be returned un-pooled for multi-instance / key-instance learning.

Install

pip install casper-descriptor
# with visualisation extras:
pip install "casper-descriptor[viz]"

Core dependencies: numpy, rdkit, scikit-learn.

Quick start

import casper

# one molecule -> one fixed-length vector (default config)
v = casper.featurize("CC(=O)Oc1ccccc1C(=O)O")

# tune the construction
cfg = casper.CasperConfig(
    n_confs=10,
    properties=("gasteiger", "abs_charge", "logp", "mr", "tpsa"),
    encoding=("hist", "autocorr"),
    n_bins=12, autocorr_bins=8, autocorr_max_dist=16.0,
    conf_pool=("mean", "max"),
)
v = casper.featurize("CCO", cfg)

# names carry full provenance: e.g. "mean:gasteiger|ac[0.0,2.0)A"
v, names = casper.featurize("CCO", cfg, return_names=True)

# batch, parallel across molecules
X = casper.featurize_many(smiles_list, cfg, n_jobs=-1)

# sklearn transformer (drops into Pipeline / GridSearchCV)
from casper import CasperFeaturizer
ft = CasperFeaturizer(n_confs=10, density=16, encoding=("hist", "autocorr"))
X = ft.fit_transform(smiles_list)

Multi-instance learning (un-pooled bag)

For key-instance detection, get the per-conformer instances without pooling:

bag, conformer_ids, names = casper.featurize_bag("CCO", cfg)
# bag: (K, d) array, one CASPER vector per conformer; K varies per molecule
# conformer_ids: trace a flagged key-instance back to its 3D geometry (deterministic for a fixed seed)

bags, names = casper.featurize_bags(smiles_list, cfg)   # ragged list of (K_i, d)

casper.featurize(...) is exactly pool() applied to this bag.

Feature visualization

When a model flags a CASPER feature as important, you can see what surface region it measures — every feature name carries full provenance back to the surface.

import casper
cfg = casper.CasperConfig(properties=("gasteiger",), encoding=("hist", "autocorr"))

# write a static PNG (matplotlib) and/or interactive HTML (py3Dmol)
casper.explain_feature("CC(=O)Oc1ccccc1C(=O)O", "mean:gasteiger|ac[2.0,4.0)A",
                       cfg, png="feature.png", html="feature.html")

# in a notebook, omit png/html for an inline interactive py3Dmol view
casper.explain_feature("CCO", "mean:gasteiger|hist[0.17,0.25)", cfg)
  • Histogram bins highlight the surface points whose property falls in the bin (the highlighted area provably equals the feature's value).
  • Autocorrelation bins show either a per-point contribution score (autocorr_mode="contribution", default) or the literal contributing point-pairs at that separation (autocorr_mode="pairs").

Requires the viz extra: pip install "casper-descriptor[viz]".

Key parameters (CasperConfig)

parameter default what it does
n_confs 10 ETKDG conformers per molecule
optimize "none" "none" (raw ETKDG, fast) / "mmff" / "uff"
properties ("gasteiger","logp","mr") which atomic properties colour the surface
probe 0.0 0.0 = VdW surface; 1.4 = water-accessible
density 16 surface dots per atom (knee of the accuracy/cost curve; cost is ~quadratic via autocorr)
encoding ("hist",) "hist" and/or "autocorr"
n_bins 12 histogram bins per property
autocorr_bins, autocorr_max_dist 8, 12.0 distance bins / radial extent for autocorrelation
autocorr_normalize True density-invariant mean-per-bin (recommended) vs legacy area-weighted sum
autocorr_range None per-property (max_dist, n_bins) override
conf_pool ("mean",) mean/max/min/std/boltzmann, concatenated

Add your own colouring:

import numpy as np
casper.register_property("my_prop", lambda mol: np.array([...]), (lo, hi))

Optional: kallisto / Jazzy properties

Five extra per-atom colourings derived from Jazzy (kallisto EEQ charges) add signal orthogonal to the built-ins — a different charge model (eeq), a real charge-dependent dynamic polarisability (alp, unlike a per-element constant), and continuous H-bond strengths (sa acceptor, sdc/sdx donor) rather than binary flags:

import casper.jazzy_properties        # registers: eeq, alp, sa, sdc, sdx
cfg = casper.CasperConfig(properties=("gasteiger", "eeq", "alp", "sa", "sdc", "sdx"))
v = casper.featurize("CC(=O)Nc1ccc(O)cc1", cfg)

They are computed once per molecule on CASPER's own geometry (cached), so all five cost roughly one kallisto evaluation. Requires pip install "casper-descriptor[jazzy]" (note: jazzy pins numpy<2).

Notes

  • density=16 is the default because surface cost is roughly quadratic (the autocorrelation is O(points²)) and accuracy plateaus there; lower (12) degrades generalisation, higher (24/32) costs more for no measurable gain.
  • The normalized autocorrelation is density-invariant, so changing density does not silently rescale features.
  • Conformer count K is data-dependent (ETKDG + RMS pruning). Set prune_rms=0 for a fixed K.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

casper_descriptor-0.1.0.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

casper_descriptor-0.1.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file casper_descriptor-0.1.0.tar.gz.

File metadata

  • Download URL: casper_descriptor-0.1.0.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for casper_descriptor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5f22036ba7281b7d49607184db78440e0af6b74629c0e83d592ab63f76f9b00a
MD5 d4a6a59787f354047f45c3ee4503b1f4
BLAKE2b-256 0c40d9c04a4496f92c22dcb01c19070cced8e4e4d097aed99d1cef81030ebd04

See more details on using hashes here.

File details

Details for the file casper_descriptor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for casper_descriptor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d22cce3be8532c382656327298c620858ed6f5bb193d11a0075a7c0b9c56293
MD5 a62cc42e9d5ae6594da3b1adf9ed2b26
BLAKE2b-256 3fce3cd4a0874c0d79a838c904f7b440e6ffa44c13527475f6e9f6e8cbc380f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page