General purpose, chemistry, and taxonomic featurization of tabular data with caching.

These details have not been verified by PyPI

Project links

Project description

🏔️ aspect

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

aspect is a lightweight featurisation layer for tabular machine-learning data.

It lets you define independent feature pipelines for different input columns, then apply them to dictionaries, Pandas data frames, local files, Hugging Face datasets, or saved dataset checkpoints.

The core idea is:

input table
  ├── numeric column  → log / identity / custom transforms
  ├── categorical col → hash / one-hot / custom transforms
  ├── SMILES column   → chemistry features, if installed
  ├── species column  → taxonomic features, if installed
  └── text column     → deep embeddings, if installed

output dataset with named feature columns

Quick start
Installation
Command-line interface
Python API
Feature specifications
Available transforms
Saving and loading pipelines
Supported input and output formats
Custom transforms
Optional chemistry, taxonomy, and deep-learning features
Development
Issues, problems, suggestions

Quick start

Python API

from aspect.data import DataPipeline

data = {
    "compound_id": ["cmpd-1", "cmpd-2", "cmpd-3"],
    "assay": ["MIC", "MIC", "IC50"],
    "mwt": [46.07, 60.10, 78.11],
    "pIC50": [5.0, 6.2, 4.8],
}

pipe = DataPipeline(
    column_transforms={
        "log_mwt": ("mwt", "log"),
        "assay_hash": (
            "assay",
            {"name": "hash", "kwargs": {"ndim": 8}},
        ),
        "target": ("pIC50", "identity"),
    },
    columns_to_keep=["compound_id"],
)

features = pipe(data, drop_unused_columns=True)

print(features.column_names)
print(pipe.data_out_shape)
print(features[:2])

Expected column structure:

['compound_id', 'log_mwt', 'assay_hash', 'target']

log_mwt and target are (n, 1) arrays. assay_hash is an (n, 8) array. The output is a Hugging Face Dataset, so it can be saved, sliced, converted to Pandas, or passed into a downstream PyTorch/data-loader layer.

Command-line quick start

Create a small CSV:

cat > compounds.csv <<'CSV'
compound_id,assay,mwt,pIC50
cmpd-1,MIC,46.07,5.0
cmpd-2,MIC,60.10,6.2
cmpd-3,IC50,78.11,4.8
CSV

Featurise it:

aspect featurize compounds.csv \
    --features mwt:log@log_mwt assay:hash@assay_hash pIC50@target \
    --extras compound_id \
    --output features.parquet

The feature specification means:

mwt:log@log_mwt       # read mwt, apply log, write column name `log_mwt`
assay:hash@assay_hash # read assay, apply deterministic hash vector, write column name `assay_hash`
pIC50@target          # read pIC50, identity transform, write column name `target`
--extras compound_id  # retain compound_id without transforming it

Reuse a saved feature specification

aspect serialize \
    --features mwt:log@log_mwt assay:hash@assay_hash pIC50@target \
    --extras compound_id \
    --output feature-spec.as

Then apply the same specification later:

aspect featurize compounds.csv \
    --config feature-spec.as \
    --output features-from-config.parquet

Installation

Basic installation

pip install aspect-data

Chemistry features

Install chemistry support if you want SMILES featurisation via schemist:

pip install "aspect-data[chem]"

Chemprop support

Install Chemprop support if you want to prepare Chemprop-style molecular data:

pip install "aspect-data[chemprop]"

Taxonomic features

Install taxonomy support if you want species or taxon-ID features via vectome:

pip install "aspect-data[bio]"

Deep-learning features

Install transformer support if you want Hugging Face model embeddings:

pip install "aspect[deep]"

Development installation

git clone https://github.com/scbirlab/aspect.git
cd aspect
pip install -e ".[dev]"

Command-line interface

The CLI has two main commands:

aspect serialize
aspect featurize

Use help for the full option list:

aspect --help
aspect serialize --help
aspect featurize --help

`aspect serialize`

Create a reusable pipeline checkpoint from a feature specification.

aspect serialize \
    --features mwt:log@log_mwt assay:hash@assay_hash \
    --extras compound_id pIC50 \
    --output feature-spec.as

`aspect featurize`

Apply a feature specification or saved config to a dataset.

aspect featurize compounds.csv \
    --features mwt:log@log_mwt assay:hash@assay_hash \
    --extras compound_id pIC50 \
    --output features.parquet

or:

aspect featurize compounds.csv \
    --config feature-spec.as \
    --output features.parquet

Slice a dataset

This is useful for testing on large datasets.

aspect featurize large-table.csv \
    --start 1000 \
    --end 2000 \
    --features assay:hash@assay_hash mwt:log@log_mwt \
    --output slice.parquet

Cache location

Use --cache to control where intermediate Hugging Face dataset files are cached.

aspect featurize compounds.csv \
    --features assay:hash@assay_hash \
    --cache .aspect-cache \
    --output features.parquet

Python API

The main class is DataPipeline.

from aspect.data import DataPipeline

A DataPipeline maps one or more input columns to named output feature columns.

pipe = DataPipeline(
    column_transforms={
        "log_mwt": ("mwt", "log"),
        "assay_hash": ("assay", "hash"),
    },
)

Apply it to a dictionary:

out = pipe({
    "mwt": [46.07, 60.10, 78.11],
    "assay": ["MIC", "MIC", "IC50"],
})

Apply it to a Pandas data frame:

import pandas as pd
from aspect.data import DataPipeline

df = pd.DataFrame({
    "mwt": [46.07, 60.10, 78.11],
    "assay": ["MIC", "MIC", "IC50"],
})

pipe = DataPipeline({
    "log_mwt": ("mwt", "log"),
    "assay_hash": ("assay", {"name": "hash", "kwargs": {"ndim": 16}}),
})

out = pipe(df)

Apply it to a local file:

pipe = DataPipeline({
    "log_mwt": ("mwt", "log"),
})

out = pipe("compounds.csv")

Apply it to a Hugging Face dataset reference:

pipe = DataPipeline({
    "assay_hash": ("assay", "hash"),
})

out = pipe("hf://datasets/scbirlab/fang-2023-biogen-adme~scaffold-split:train")

Feature specifications

A feature specification defines:

the input column;
the transform or chain of transforms;
the output column name.

Python feature specs

The recommended Python form is:

{
    "output_column": ("input_column", "transform_name")
}

Example:

pipe = DataPipeline({
    "log_mwt": ("mwt", "log"),
})

For transform arguments, use a dictionary with a nested kwargs field:

pipe = DataPipeline({
    "assay_hash": (
        "assay",
        {"name": "hash", "kwargs": {"ndim": 32, "seed": 123}},
    ),
})

For chained transforms, pass a list:

pipe = DataPipeline({
    "log_hashed_assay": (
        "assay",
        [
            {"name": "hash", "kwargs": {"ndim": 16}},
            "log",
        ],
    ),
})

Only use transform chains when the output of one transform is mathematically valid input for the next. For example, hash → log is usually not sensible because hash values may be negative.

CLI feature specs

CLI specs use a compact string form:

input_column:transform@output_column

Examples:

mwt:log@log_mwt
assay:hash@assay_hash
pIC50@target

Supported forms:

Spec	Meaning
`column`	keep `column` as an extra column
`column@new_name`	identity transform from `column` to `new_name`
`column:transform`	transform `column`; auto-name the output
`column:transform@new_name`	transform `column`; write `new_name`
`column:t1:t2@new_name`	apply chained transforms

For parameterised transforms, prefer the Python API. The Python API supports structured keyword arguments cleanly through {"name": ..., "kwargs": ...}.

Available transforms

Base transforms

These are available with the standard installation.

Transform	Input	Output	Notes
`identity`	any column	array	Pass-through transform
`log`	numeric	array	Natural logarithm, `np.log`
`hash`	string-like	fixed-length vector	Deterministic string hashing
`one-hot`	categorical	vector	Python API recommended because categories must be supplied

Example:

from aspect.data import DataPipeline

pipe = DataPipeline({
    "log_mwt": ("mwt", "log"),
    "assay_hash": (
        "assay",
        {"name": "hash", "kwargs": {"ndim": 16}},
    ),
    "assay_onehot": (
        "assay",
        {"name": "one-hot", "kwargs": {"categories": ["MIC", "IC50"]}},
    ),
})

Chemistry transforms

Require:

pip install "aspect[chem]"

Transform	Input	Output
`morgan-fingerprint`	SMILES	molecular fingerprint
`descriptors-2d`	SMILES	2D molecular descriptors
`descriptors-3d`	SMILES	3D molecular descriptors

Example:

from aspect.data import DataPipeline

data = {
    "compound_id": ["ethanol", "ethylamine", "benzene"],
    "smiles": ["CCO", "CCN", "c1ccccc1"],
}

pipe = DataPipeline(
    {
        "morgan": ("smiles", "morgan-fingerprint"),
        "desc2d": ("smiles", "descriptors-2d"),
    },
    columns_to_keep=["compound_id"],
)

features = pipe(data, drop_unused_columns=True)

Taxonomic transforms

Require:

pip install "aspect[bio]"

Transform	Input	Output
`vectome-fingerprint`	species name or taxon ID	taxonomic fingerprint

Example:

from aspect.data import DataPipeline

data = {
    "species": [
        "Mycobacterium tuberculosis",
        "Escherichia coli",
        "Staphylococcus aureus",
    ],
}

pipe = DataPipeline({
    "species_fp": (
        "species",
        {"name": "vectome-fingerprint", "kwargs": {"ndim": 128}},
    ),
})

features = pipe(data)

Deep embedding transforms

Require:

pip install "aspect[deep]"

Transform	Input	Output
`hf-bart`	text	aggregated BART encoder/decoder embedding

Example:

from aspect.data import DataPipeline

data = {
    "description": [
        "cell wall inhibitor",
        "protein synthesis inhibitor",
        "DNA gyrase inhibitor",
    ],
}

pipe = DataPipeline({
    "text_embedding": (
        "description",
        {
            "name": "hf-bart",
            "kwargs": {
                "ref": "facebook/bart-base",
                "aggregator": ["mean", "max"],
            },
        },
    ),
})

features = pipe(data)

The model will be loaded through Hugging Face transformers, so the first run may download model weights.

Chemprop transform

Require:

pip install "aspect[chemprop]"

chemprop-mol is intended for converting SMILES into Chemprop-style molecular datapoints, with optional labels and extra dense features.

Because Chemprop models often need special collation and graph batching, use this transform together with a downstream model adapter or data-loader layer rather than assuming the output should be horizontally stacked with ordinary numeric arrays.

Saving and loading pipelines

DataPipeline objects can be checkpointed.

from aspect.data import DataPipeline

pipe = DataPipeline({
    "log_mwt": ("mwt", "log"),
    "assay_hash": (
        "assay",
        {"name": "hash", "kwargs": {"ndim": 16}},
    ),
})

pipe.save_checkpoint("feature-spec.as")

Load the checkpoint later:

from aspect.data import DataPipeline

pipe = DataPipeline().load_checkpoint("feature-spec.as")

out = pipe({
    "mwt": [46.07, 60.10, 78.11],
    "assay": ["MIC", "MIC", "IC50"],
})

If the pipeline has already been applied to data, the checkpoint can also store input/output datasets unless skipped:

pipe.save_checkpoint(
    "feature-spec-and-data.as",
    skip_data_in=False,
    skip_data_out=False,
)

For reusable feature specifications, it is usually cleaner to save the pipeline before applying it to a dataset, or to skip data when checkpointing.

Supported input and output formats

Python inputs

DataPipeline accepts:

Input type	Example
dictionary / mapping	`{"smiles": ["CCO", "CCN"]}`
Pandas `DataFrame`	`pipe(df)`
Hugging Face `Dataset`	`pipe(dataset)`
Hugging Face `IterableDataset`	`pipe(iterable_dataset)`
local file path	`pipe("data.csv")`
Hugging Face dataset ref	`pipe("hf://datasets/org/name~config:split@revision")`

Local file inputs

Supported local input extensions include:

.csv, .csv.gz
.tsv, .tsv.gz
.txt, .txt.gz
.json
.parquet
.arrow
.xml
.hf saved Hugging Face datasets

CLI outputs

aspect featurize --output can write:

Extension	Output
`.csv`, `.csv.gz`	CSV
`.tsv`, `.txt`, optionally gzipped	delimited text
`.json`	JSON
`.parquet`	Parquet
`.sql`	SQL
`.hf`	Hugging Face dataset saved to disk
unknown extension	Hugging Face dataset, with `.hf` appended

Working with output columns

By default, DataPipeline keeps the original input columns and appends feature columns.

out = pipe(data)

To keep only feature outputs and selected metadata:

pipe = DataPipeline(
    {
        "log_mwt": ("mwt", "log"),
        "assay_hash": ("assay", "hash"),
    },
    columns_to_keep=["compound_id"],
)

out = pipe(data, drop_unused_columns=True)

You can also provide extra columns at call time:

out = pipe(
    data,
    drop_unused_columns=True,
    keep_extra_columns=["compound_id", "target"],
)

After running a pipeline, inspect inferred output shapes:

print(pipe.data_out_shape)

Example:

{
    "compound_id": (),
    "log_mwt": (1,),
    "assay_hash": (256,)
}

Custom transforms

Custom transforms can be registered using register_function.

A registered transform should be a factory: it receives configuration arguments and returns a callable with signature:

fn(data, input_column) -> numpy.ndarray

Example:

import numpy as np

from aspect.data import DataPipeline
from aspect.transform.registry import register_function


@register_function("square")
def Square():
    def _square(data, input_column):
        x = np.asarray(data[input_column], dtype=float)
        return x ** 2

    return _square


pipe = DataPipeline({
    "mwt_squared": ("mwt", "square"),
})

out = pipe({
    "mwt": [46.07, 60.10, 78.11],
})

With arguments:

@register_function("scale")
def Scale(factor=1.0):
    def _scale(data, input_column):
        x = np.asarray(data[input_column], dtype=float)
        return x * factor

    return _scale


pipe = DataPipeline({
    "scaled_mwt": (
        "mwt",
        {"name": "scale", "kwargs": {"factor": 0.001}},
    ),
})

If you save a pipeline using a custom transform, make sure the module that registers the transform is imported before loading the checkpoint.

Optional chemistry, taxonomy, and deep-learning examples

Mixed ADME-style table

from aspect.data import DataPipeline

data = {
    "compound_id": ["cmpd-1", "cmpd-2", "cmpd-3"],
    "smiles": ["CCO", "CCN", "c1ccccc1"],
    "assay": ["solubility", "clearance", "microsome"],
    "species": ["M. tuberculosis", "E. coli", "S. aureus"],
    "mwt": [46.07, 45.08, 78.11],
    "label": [0.1, 0.4, 0.8],
}

pipe = DataPipeline(
    {
        "morgan": ("smiles", "morgan-fingerprint"),
        "desc2d": ("smiles", "descriptors-2d"),
        "assay_hash": (
            "assay",
            {"name": "hash", "kwargs": {"ndim": 32}},
        ),
        "species_fp": (
            "species",
            {"name": "vectome-fingerprint", "kwargs": {"ndim": 64}},
        ),
        "log_mwt": ("mwt", "log"),
        "y": ("label", "identity"),
    },
    columns_to_keep=["compound_id"],
)

features = pipe(data, drop_unused_columns=True)

This produces separate feature columns for chemical structure, descriptors, assay identity, species identity, molecular weight, and target label.

A downstream model adapter can decide whether to:

concatenate dense arrays;
route different columns into different neural-network towers;
collate graph-like objects separately;
keep labels and metadata separate from model inputs.

Development

Install development dependencies:

pip install -e ".[dev]"

Run tests:

pytest

Run the CLI smoke test script:

bash test/scripts/run-tests.sh

Run selected doctests:

python -m doctest aspect/data.py
python -m doctest aspect/transform/base.py
python -m doctest aspect/transform/functions.py

Issues, problems, suggestions

Please open an issue on GitHub:

https://github.com/scbirlab/aspect/issues

Documentation

(To come at ReadTheDocs.)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.3.post1

Apr 26, 2026

0.0.3

Apr 26, 2026

0.0.2

Apr 26, 2026

This version

0.0.1

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aspect_data-0.0.1.tar.gz (30.7 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aspect_data-0.0.1-py3-none-any.whl (29.1 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file aspect_data-0.0.1.tar.gz.

File metadata

Download URL: aspect_data-0.0.1.tar.gz
Upload date: Apr 26, 2026
Size: 30.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for aspect_data-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`38d02b58a9261cbfc040523b75026d64b4f1a4b5995f66a33668b74f06880c0c`
MD5	`4acec587a4e06cb4f75286b083c2ce96`
BLAKE2b-256	`ec877c8e6b40d87ed974dc52a0436dfa1aeb960851aa38644c56dd614e6e92ee`

See more details on using hashes here.

File details

Details for the file aspect_data-0.0.1-py3-none-any.whl.

File metadata

Download URL: aspect_data-0.0.1-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 29.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for aspect_data-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18405701c94cf012659d790c2fc9f65b37ad5577f4cb7112498684dd27ec994d`
MD5	`bfd3b0e03ae22237ef9053dfc1a61d5f`
BLAKE2b-256	`e86bd7bb09798684430e61d534948317b26ee646c12c0e4daf6212c12425c94a`

See more details on using hashes here.

aspect-data 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🏔️ aspect

Contents

Quick start

Python API

Command-line quick start

Reuse a saved feature specification

Installation

Basic installation

Chemistry features

Chemprop support

Taxonomic features

Deep-learning features

Development installation

Command-line interface

aspect serialize

aspect featurize

Slice a dataset

Cache location

Python API

Feature specifications

Python feature specs

CLI feature specs

Available transforms

Base transforms

Chemistry transforms

Taxonomic transforms

Deep embedding transforms

Chemprop transform

Saving and loading pipelines

Supported input and output formats

Python inputs

Local file inputs

CLI outputs

Working with output columns

Custom transforms

Optional chemistry, taxonomy, and deep-learning examples

Mixed ADME-style table

Development

Issues, problems, suggestions

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`aspect serialize`

`aspect featurize`