General purpose, chemistry, and taxonomic featurization of tabular data with caching.
Project description
🏔️ aspect
aspect is a lightweight featurisation layer for tabular machine-learning data.
It lets you define independent feature pipelines for different input columns, then apply them to dictionaries, Pandas data frames, local files, Hugging Face datasets, or saved dataset checkpoints.
The core idea is:
input table
├── numeric column → log / identity / custom transforms
├── categorical col → hash / one-hot / custom transforms
├── SMILES column → chemistry features, if installed
├── species column → taxonomic features, if installed
└── text column → deep embeddings, if installed
output dataset with named feature columns
Contents
- Quick start
- Installation
- Command-line interface
- Python API
- Feature specifications
- Available transforms
- Saving and loading pipelines
- Supported input and output formats
- Custom transforms
- Optional chemistry, taxonomy, and deep-learning features
- Development
- Issues, problems, suggestions
Quick start
Python API
from aspect.data import DataPipeline
data = {
"compound_id": ["cmpd-1", "cmpd-2", "cmpd-3"],
"assay": ["MIC", "MIC", "IC50"],
"mwt": [46.07, 60.10, 78.11],
"pIC50": [5.0, 6.2, 4.8],
}
pipe = DataPipeline(
column_transforms={
"log_mwt": ("mwt", "log"),
"assay_hash": (
"assay",
{"name": "hash", "kwargs": {"ndim": 8}},
),
"target": ("pIC50", "identity"),
},
columns_to_keep=["compound_id"],
)
features = pipe(data, drop_unused_columns=True)
print(features.column_names)
print(pipe.data_out_shape)
print(features[:2])
Expected column structure:
['compound_id', 'log_mwt', 'assay_hash', 'target']
log_mwt and target are (n, 1) arrays. assay_hash is an (n, 8) array. The output is a Hugging Face Dataset, so it can be saved, sliced, converted to Pandas, or passed into a downstream PyTorch/data-loader layer.
Command-line quick start
Create a small CSV:
cat > compounds.csv <<'CSV'
compound_id,assay,mwt,pIC50
cmpd-1,MIC,46.07,5.0
cmpd-2,MIC,60.10,6.2
cmpd-3,IC50,78.11,4.8
CSV
Featurise it:
aspect featurize compounds.csv \
--features mwt:log@log_mwt assay:hash@assay_hash pIC50@target \
--extras compound_id \
--output features.parquet
The feature specification means:
mwt:log@log_mwt # read mwt, apply log, write column name `log_mwt`
assay:hash@assay_hash # read assay, apply deterministic hash vector, write column name `assay_hash`
pIC50@target # read pIC50, identity transform, write column name `target`
--extras compound_id # retain compound_id without transforming it
Reuse a saved feature specification
aspect serialize \
--features mwt:log@log_mwt assay:hash@assay_hash pIC50@target \
--extras compound_id \
--output feature-spec.as
Then apply the same specification later:
aspect featurize compounds.csv \
--config feature-spec.as \
--output features-from-config.parquet
Installation
Basic installation
pip install aspect-data
Chemistry features
Install chemistry support if you want SMILES featurisation via schemist:
pip install "aspect-data[chem]"
Chemprop support
Install Chemprop support if you want to prepare Chemprop-style molecular data:
pip install "aspect-data[chemprop]"
Taxonomic features
Install taxonomy support if you want species or taxon-ID features via vectome:
pip install "aspect-data[bio]"
Deep-learning features
Install transformer support if you want Hugging Face model embeddings:
pip install "aspect[deep]"
Development installation
git clone https://github.com/scbirlab/aspect.git
cd aspect
pip install -e ".[dev]"
Command-line interface
The CLI has two main commands:
aspect serialize
aspect featurize
Use help for the full option list:
aspect --help
aspect serialize --help
aspect featurize --help
aspect serialize
Create a reusable pipeline checkpoint from a feature specification.
aspect serialize \
--features mwt:log@log_mwt assay:hash@assay_hash \
--extras compound_id pIC50 \
--output feature-spec.as
aspect featurize
Apply a feature specification or saved config to a dataset.
aspect featurize compounds.csv \
--features mwt:log@log_mwt assay:hash@assay_hash \
--extras compound_id pIC50 \
--output features.parquet
or:
aspect featurize compounds.csv \
--config feature-spec.as \
--output features.parquet
Slice a dataset
This is useful for testing on large datasets.
aspect featurize large-table.csv \
--start 1000 \
--end 2000 \
--features assay:hash@assay_hash mwt:log@log_mwt \
--output slice.parquet
Cache location
Use --cache to control where intermediate Hugging Face dataset files are cached.
aspect featurize compounds.csv \
--features assay:hash@assay_hash \
--cache .aspect-cache \
--output features.parquet
Python API
The main class is DataPipeline.
from aspect.data import DataPipeline
A DataPipeline maps one or more input columns to named output feature columns.
pipe = DataPipeline(
column_transforms={
"log_mwt": ("mwt", "log"),
"assay_hash": ("assay", "hash"),
},
)
Apply it to a dictionary:
out = pipe({
"mwt": [46.07, 60.10, 78.11],
"assay": ["MIC", "MIC", "IC50"],
})
Apply it to a Pandas data frame:
import pandas as pd
from aspect.data import DataPipeline
df = pd.DataFrame({
"mwt": [46.07, 60.10, 78.11],
"assay": ["MIC", "MIC", "IC50"],
})
pipe = DataPipeline({
"log_mwt": ("mwt", "log"),
"assay_hash": ("assay", {"name": "hash", "kwargs": {"ndim": 16}}),
})
out = pipe(df)
Apply it to a local file:
pipe = DataPipeline({
"log_mwt": ("mwt", "log"),
})
out = pipe("compounds.csv")
Apply it to a Hugging Face dataset reference:
pipe = DataPipeline({
"assay_hash": ("assay", "hash"),
})
out = pipe("hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train")
Feature specifications
A feature specification defines:
- the input column;
- the transform or chain of transforms;
- the output column name.
Python feature specs
The recommended Python form is:
{
"output_column": ("input_column", "transform_name")
}
Example:
pipe = DataPipeline({
"log_mwt": ("mwt", "log"),
})
For transform arguments, use a dictionary with a nested kwargs field:
pipe = DataPipeline({
"assay_hash": (
"assay",
{"name": "hash", "kwargs": {"ndim": 32, "seed": 123}},
),
})
For chained transforms, pass a list:
pipe = DataPipeline({
"log_hashed_assay": (
"assay",
[
{"name": "hash", "kwargs": {"ndim": 16}},
"log",
],
),
})
Only use transform chains when the output of one transform is mathematically valid input for the next. For example, hash → log is usually not sensible because hash values may be negative.
CLI feature specs
CLI specs use a compact string form:
input_column:transform@output_column
Examples:
mwt:log@log_mwt
assay:hash@assay_hash
pIC50@target
Supported forms:
| Spec | Meaning |
|---|---|
column |
keep column as an extra column |
column@new_name |
identity transform from column to new_name |
column:transform |
transform column; auto-name the output |
column:transform@new_name |
transform column; write new_name |
column:t1:t2@new_name |
apply chained transforms |
For parameterised transforms, prefer the Python API. The Python API supports structured keyword arguments cleanly through {"name": ..., "kwargs": ...}.
Available transforms
Base transforms
These are available with the standard installation.
| Transform | Input | Output | Notes |
|---|---|---|---|
identity |
any column | array | Pass-through transform |
log |
numeric | array | Natural logarithm, np.log |
hash |
string-like | fixed-length vector | Deterministic string hashing |
one-hot |
categorical | vector | Python API recommended because categories must be supplied |
Example:
from aspect.data import DataPipeline
pipe = DataPipeline({
"log_mwt": ("mwt", "log"),
"assay_hash": (
"assay",
{"name": "hash", "kwargs": {"ndim": 16}},
),
"assay_onehot": (
"assay",
{"name": "one-hot", "kwargs": {"categories": ["MIC", "IC50"]}},
),
})
Chemistry transforms
Require:
pip install "aspect[chem]"
| Transform | Input | Output |
|---|---|---|
morgan-fingerprint |
SMILES | molecular fingerprint |
descriptors-2d |
SMILES | 2D molecular descriptors |
descriptors-3d |
SMILES | 3D molecular descriptors |
Example:
from aspect.data import DataPipeline
data = {
"compound_id": ["ethanol", "ethylamine", "benzene"],
"smiles": ["CCO", "CCN", "c1ccccc1"],
}
pipe = DataPipeline(
{
"morgan": ("smiles", "morgan-fingerprint"),
"desc2d": ("smiles", "descriptors-2d"),
},
columns_to_keep=["compound_id"],
)
features = pipe(data, drop_unused_columns=True)
Taxonomic transforms
Require:
pip install "aspect[bio]"
| Transform | Input | Output |
|---|---|---|
vectome-fingerprint |
species name or taxon ID | taxonomic fingerprint |
Example:
from aspect.data import DataPipeline
data = {
"species": [
"Mycobacterium tuberculosis",
"Escherichia coli",
"Staphylococcus aureus",
],
}
pipe = DataPipeline({
"species_fp": (
"species",
{"name": "vectome-fingerprint", "kwargs": {"ndim": 128}},
),
})
features = pipe(data)
Deep embedding transforms
Require:
pip install "aspect[deep]"
| Transform | Input | Output |
|---|---|---|
hf-bart |
text | aggregated BART encoder/decoder embedding |
Example:
from aspect.data import DataPipeline
data = {
"description": [
"cell wall inhibitor",
"protein synthesis inhibitor",
"DNA gyrase inhibitor",
],
}
pipe = DataPipeline({
"text_embedding": (
"description",
{
"name": "hf-bart",
"kwargs": {
"ref": "facebook/bart-base",
"aggregator": ["mean", "max"],
},
},
),
})
features = pipe(data)
The model will be loaded through Hugging Face transformers, so the first run may download model weights.
Chemprop transform
Require:
pip install "aspect[chemprop]"
chemprop-mol is intended for converting SMILES into Chemprop-style molecular datapoints, with optional labels and extra dense features.
Because Chemprop models often need special collation and graph batching, use this transform together with a downstream model adapter or data-loader layer rather than assuming the output should be horizontally stacked with ordinary numeric arrays.
Saving and loading pipelines
DataPipeline objects can be checkpointed.
from aspect.data import DataPipeline
pipe = DataPipeline({
"log_mwt": ("mwt", "log"),
"assay_hash": (
"assay",
{"name": "hash", "kwargs": {"ndim": 16}},
),
})
pipe.save_checkpoint("feature-spec.as")
Load the checkpoint later:
from aspect.data import DataPipeline
pipe = DataPipeline().load_checkpoint("feature-spec.as")
out = pipe({
"mwt": [46.07, 60.10, 78.11],
"assay": ["MIC", "MIC", "IC50"],
})
If the pipeline has already been applied to data, the checkpoint can also store input/output datasets unless skipped:
pipe.save_checkpoint(
"feature-spec-and-data.as",
skip_data_in=False,
skip_data_out=False,
)
For reusable feature specifications, it is usually cleaner to save the pipeline before applying it to a dataset, or to skip data when checkpointing.
Supported input and output formats
Python inputs
DataPipeline accepts:
| Input type | Example |
|---|---|
| dictionary / mapping | {"smiles": ["CCO", "CCN"]} |
Pandas DataFrame |
pipe(df) |
Hugging Face Dataset |
pipe(dataset) |
Hugging Face IterableDataset |
pipe(iterable_dataset) |
| local file path | pipe("data.csv") |
| Hugging Face dataset ref | pipe("hf://datasets/org/name~config:split@revision") |
Local file inputs
Supported local input extensions include:
.csv,.csv.gz.tsv,.tsv.gz.txt,.txt.gz.json.parquet.arrow.xml.hfsaved Hugging Face datasets
CLI outputs
aspect featurize --output can write:
| Extension | Output |
|---|---|
.csv, .csv.gz |
CSV |
.tsv, .txt, optionally gzipped |
delimited text |
.json |
JSON |
.parquet |
Parquet |
.sql |
SQL |
.hf |
Hugging Face dataset saved to disk |
| unknown extension | Hugging Face dataset, with .hf appended |
Working with output columns
By default, DataPipeline keeps the original input columns and appends feature columns.
out = pipe(data)
To keep only feature outputs and selected metadata:
pipe = DataPipeline(
{
"log_mwt": ("mwt", "log"),
"assay_hash": ("assay", "hash"),
},
columns_to_keep=["compound_id"],
)
out = pipe(data, drop_unused_columns=True)
You can also provide extra columns at call time:
out = pipe(
data,
drop_unused_columns=True,
keep_extra_columns=["compound_id", "target"],
)
After running a pipeline, inspect inferred output shapes:
print(pipe.data_out_shape)
Example:
{
"compound_id": (),
"log_mwt": (1,),
"assay_hash": (256,)
}
Custom transforms
Custom transforms can be registered using register_function.
A registered transform should be a factory: it receives configuration arguments and returns a callable with signature:
fn(data, input_column) -> numpy.ndarray
Example:
import numpy as np
from aspect.data import DataPipeline
from aspect.transform.registry import register_function
@register_function("square")
def Square():
def _square(data, input_column):
x = np.asarray(data[input_column], dtype=float)
return x ** 2
return _square
pipe = DataPipeline({
"mwt_squared": ("mwt", "square"),
})
out = pipe({
"mwt": [46.07, 60.10, 78.11],
})
With arguments:
@register_function("scale")
def Scale(factor=1.0):
def _scale(data, input_column):
x = np.asarray(data[input_column], dtype=float)
return x * factor
return _scale
pipe = DataPipeline({
"scaled_mwt": (
"mwt",
{"name": "scale", "kwargs": {"factor": 0.001}},
),
})
If you save a pipeline using a custom transform, make sure the module that registers the transform is imported before loading the checkpoint.
Optional chemistry, taxonomy, and deep-learning examples
Mixed ADME-style table
from aspect.data import DataPipeline
data = {
"compound_id": ["cmpd-1", "cmpd-2", "cmpd-3"],
"smiles": ["CCO", "CCN", "c1ccccc1"],
"assay": ["solubility", "clearance", "microsome"],
"species": ["M. tuberculosis", "E. coli", "S. aureus"],
"mwt": [46.07, 45.08, 78.11],
"label": [0.1, 0.4, 0.8],
}
pipe = DataPipeline(
{
"morgan": ("smiles", "morgan-fingerprint"),
"desc2d": ("smiles", "descriptors-2d"),
"assay_hash": (
"assay",
{"name": "hash", "kwargs": {"ndim": 32}},
),
"species_fp": (
"species",
{"name": "vectome-fingerprint", "kwargs": {"ndim": 64}},
),
"log_mwt": ("mwt", "log"),
"y": ("label", "identity"),
},
columns_to_keep=["compound_id"],
)
features = pipe(data, drop_unused_columns=True)
This produces separate feature columns for chemical structure, descriptors, assay identity, species identity, molecular weight, and target label.
A downstream model adapter can decide whether to:
- concatenate dense arrays;
- route different columns into different neural-network towers;
- collate graph-like objects separately;
- keep labels and metadata separate from model inputs.
Development
Install development dependencies:
pip install -e ".[dev]"
Run tests:
pytest
Run the CLI smoke test script:
bash test/scripts/run-tests.sh
Run selected doctests:
python -m doctest aspect/data.py
python -m doctest aspect/transform/base.py
python -m doctest aspect/transform/functions.py
Issues, problems, suggestions
Please open an issue on GitHub:
https://github.com/scbirlab/aspect/issues
Documentation
(To come at ReadTheDocs.)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aspect_data-0.0.1.tar.gz.
File metadata
- Download URL: aspect_data-0.0.1.tar.gz
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38d02b58a9261cbfc040523b75026d64b4f1a4b5995f66a33668b74f06880c0c
|
|
| MD5 |
4acec587a4e06cb4f75286b083c2ce96
|
|
| BLAKE2b-256 |
ec877c8e6b40d87ed974dc52a0436dfa1aeb960851aa38644c56dd614e6e92ee
|
File details
Details for the file aspect_data-0.0.1-py3-none-any.whl.
File metadata
- Download URL: aspect_data-0.0.1-py3-none-any.whl
- Upload date:
- Size: 29.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18405701c94cf012659d790c2fc9f65b37ad5577f4cb7112498684dd27ec994d
|
|
| MD5 |
bfd3b0e03ae22237ef9053dfc1a61d5f
|
|
| BLAKE2b-256 |
e86bd7bb09798684430e61d534948317b26ee646c12c0e4daf6212c12425c94a
|