CASPER: Conformer-Averaged Surface Property Encoded Representation -- a tunable 3D molecular descriptor
Project description
CASPER
Conformer-Averaged Surface Property Encoded Representation — a tunable 3D molecular descriptor.
CASPER turns a molecule into a fixed-length feature vector by:
- generating ETKDG conformers (no force-field minimisation needed by default),
- building a van-der-Waals dot surface for each conformer,
- colouring each surface point by an atomic property (partial charge, logP, molar refractivity, TPSA, H-bond donor/acceptor, …),
- encoding the coloured surface as a property histogram (a 3D generalisation of MOE-style VSA descriptors) and/or a density-invariant spatial autocorrelation (VolSurf-flavoured), and
- pooling across conformers.
Unlike 2D VSA descriptors, CASPER is built on real 3D conformer surfaces, and unlike single-conformer VolSurf it averages over a conformer ensemble. Every step is exposed as a tunable parameter, and the per-conformer "bag" can be returned un-pooled for multi-instance / key-instance learning.
Install
pip install casper-descriptor
# with visualisation extras:
pip install "casper-descriptor[viz]"
Core dependencies: numpy, rdkit, scikit-learn.
Quick start
import casper
# one molecule -> one fixed-length vector (default config)
v = casper.featurize("CC(=O)Oc1ccccc1C(=O)O")
# tune the construction
cfg = casper.CasperConfig(
n_confs=10,
properties=("gasteiger", "abs_charge", "logp", "mr", "tpsa"),
encoding=("hist", "autocorr"),
n_bins=12, autocorr_bins=8, autocorr_max_dist=16.0,
conf_pool=("mean", "max"),
)
v = casper.featurize("CCO", cfg)
# names carry full provenance: e.g. "mean:gasteiger|ac[0.0,2.0)A"
v, names = casper.featurize("CCO", cfg, return_names=True)
# batch, parallel across molecules
X = casper.featurize_many(smiles_list, cfg, n_jobs=-1)
# sklearn transformer (drops into Pipeline / GridSearchCV)
from casper import CasperFeaturizer
ft = CasperFeaturizer(n_confs=10, density=16, encoding=("hist", "autocorr"))
X = ft.fit_transform(smiles_list)
Multi-instance learning (un-pooled bag)
For key-instance detection, get the per-conformer instances without pooling:
bag, conformer_ids, names = casper.featurize_bag("CCO", cfg)
# bag: (K, d) array, one CASPER vector per conformer; K varies per molecule
# conformer_ids: trace a flagged key-instance back to its 3D geometry (deterministic for a fixed seed)
bags, names = casper.featurize_bags(smiles_list, cfg) # ragged list of (K_i, d)
casper.featurize(...) is exactly pool() applied to this bag.
Feature visualization
When a model flags a CASPER feature as important, you can see what surface region it measures — every feature name carries full provenance back to the surface.
import casper
cfg = casper.CasperConfig(properties=("gasteiger",), encoding=("hist", "autocorr"))
# write a static PNG (matplotlib) and/or interactive HTML (py3Dmol)
casper.explain_feature("CC(=O)Oc1ccccc1C(=O)O", "mean:gasteiger|ac[2.0,4.0)A",
cfg, png="feature.png", html="feature.html")
# in a notebook, omit png/html for an inline interactive py3Dmol view
casper.explain_feature("CCO", "mean:gasteiger|hist[0.17,0.25)", cfg)
- Histogram bins highlight the surface points whose property falls in the bin (the highlighted area provably equals the feature's value).
- Autocorrelation bins show either a per-point contribution score
(
autocorr_mode="contribution", default) or the literal contributing point-pairs at that separation (autocorr_mode="pairs").
Requires the viz extra: pip install "casper-descriptor[viz]".
Key parameters (CasperConfig)
| parameter | default | what it does |
|---|---|---|
n_confs |
10 | ETKDG conformers per molecule |
optimize |
"none" |
"none" (raw ETKDG, fast) / "mmff" / "uff" |
properties |
("gasteiger","logp","mr") |
which atomic properties colour the surface |
probe |
0.0 |
0.0 = VdW surface; 1.4 = water-accessible |
density |
16 |
surface dots per atom (knee of the accuracy/cost curve; cost is ~quadratic via autocorr) |
encoding |
("hist",) |
"hist" and/or "autocorr" |
n_bins |
12 | histogram bins per property |
autocorr_bins, autocorr_max_dist |
8, 12.0 | distance bins / radial extent for autocorrelation |
autocorr_normalize |
True |
density-invariant mean-per-bin (recommended) vs legacy area-weighted sum |
autocorr_range |
None |
per-property (max_dist, n_bins) override |
conf_pool |
("mean",) |
mean/max/min/std/boltzmann, concatenated |
Add your own colouring:
import numpy as np
casper.register_property("my_prop", lambda mol: np.array([...]), (lo, hi))
Optional: kallisto / Jazzy properties
Five extra per-atom colourings derived from Jazzy
(kallisto EEQ charges) add signal orthogonal to the built-ins — a different charge
model (eeq), a real charge-dependent dynamic polarisability (alp, unlike a
per-element constant), and continuous H-bond strengths (sa acceptor, sdc/sdx
donor) rather than binary flags:
import casper.jazzy_properties # registers: eeq, alp, sa, sdc, sdx
cfg = casper.CasperConfig(properties=("gasteiger", "eeq", "alp", "sa", "sdc", "sdx"))
v = casper.featurize("CC(=O)Nc1ccc(O)cc1", cfg)
They are computed once per molecule on CASPER's own geometry (cached), so all five
cost roughly one kallisto evaluation. Requires pip install "casper-descriptor[jazzy]"
(note: jazzy pins numpy<2).
Notes
density=16is the default because surface cost is roughly quadratic (the autocorrelation is O(points²)) and accuracy plateaus there; lower (12) degrades generalisation, higher (24/32) costs more for no measurable gain.- The normalized autocorrelation is density-invariant, so changing
densitydoes not silently rescale features. - Conformer count
Kis data-dependent (ETKDG + RMS pruning). Setprune_rms=0for a fixedK.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file casper_descriptor-0.1.0.tar.gz.
File metadata
- Download URL: casper_descriptor-0.1.0.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f22036ba7281b7d49607184db78440e0af6b74629c0e83d592ab63f76f9b00a
|
|
| MD5 |
d4a6a59787f354047f45c3ee4503b1f4
|
|
| BLAKE2b-256 |
0c40d9c04a4496f92c22dcb01c19070cced8e4e4d097aed99d1cef81030ebd04
|
File details
Details for the file casper_descriptor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: casper_descriptor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d22cce3be8532c382656327298c620858ed6f5bb193d11a0075a7c0b9c56293
|
|
| MD5 |
a62cc42e9d5ae6594da3b1adf9ed2b26
|
|
| BLAKE2b-256 |
3fce3cd4a0874c0d79a838c904f7b440e6ffa44c13527475f6e9f6e8cbc380f9
|