Traceable, reproducible molecular descriptor-space workflows for CLI and Python.
Project description
HDDFlyzer: High-Dimensional Descriptor-based Feature Space Analyzer
Description
HDDFlyzer is a CLI-first cheminformatics toolkit with a Python API for reproducible, traceable molecular descriptor-space workflows. It generates descriptor, similarity, projection, and visualization outputs from compound collections, stores them with metadata, and supports later reconstruction, inspection, and analysis of completed runs.
HDDFlyzer can be used in two complementary ways:
- as a command-line tool, through the
hddflyzercommand; - as a Python library, for workflow execution, run reconstruction, artifact inspection, and scientific analysis over existing outputs.
The CLI and Python API are two entry points into the same traceable workflow engine.
What Is HDDFlyzer?
HDDFlyzer organizes a cheminformatics analysis around a local molecule registry and a dataset-first results folder. A typical workflow starts from a CSV of compounds, creates a canonical registry, annotates molecules, computes Tanimoto similarity and molecular descriptor tables, performs feature curation and correlation-based pruning, projects molecular spaces with PCA, t-SNE, or UMAP, and writes figures plus metadata-rich outputs.
The design goal is practical reproducibility: every major operation writes
stable files under results/<tag>/, and each step keeps enough metadata to
understand how the output was generated.
Why Use It?
Cheminformatics workflows often produce many disconnected files: descriptor tables, similarity matrices, dimensionality-reduction coordinates, figures, and metadata. HDDFlyzer keeps these products organized under a dataset-specific results folder and records how they were generated.
This makes the package useful for exploratory molecular-space analysis, teaching, workflow prototyping, and reproducible local studies where the relationship between input molecules, computational steps, and generated artifacts must remain visible.
Current Scope and Non-Goals
HDDFlyzer currently focuses on traceable molecular descriptor-space workflows for local datasets. It supports registry preparation, natural-product class annotation, similarity calculation, descriptor engineering, feature curation, dimensionality reduction, visualization, run reconstruction, and scientific views over existing outputs.
HDDFlyzer is not a docking tool, a complete drug-discovery platform, an automatic enrichment workflow, or an automatic chemical interpretation system. Its scientific helpers operate on already generated artifacts; they do not rerun the pipeline or infer chemical conclusions automatically.
Core Workflow
molecule CSV
-> canonical registry
-> NP-class annotation
-> Tanimoto similarity
-> descriptor tables
-> feature curation and pruning
-> PCA/t-SNE/UMAP projections
-> figures, metadata, manifest, and workflow summary
-> reconstructed result views from Python
All outputs for a collection are stored under one dataset-specific directory, for example:
results/aocd/
Installation
Python 3.11 or newer is recommended. RDKit is best installed from conda-forge.
After v0.1.5 is merged, tagged, and published through the manual PyPI
workflow, HDDFlyzer can be installed from PyPI:
pip install hddflyzer
For local development from a source checkout:
conda create -n hddflyzer_env python=3.11
conda activate hddflyzer_env
conda install -c conda-forge rdkit
pip install -e ".[dev]"
UMAP support is included through umap-learn in the package dependencies.
Input Data
Place local input collections in:
examples/
For example:
examples/valid_metadata_aocd.csv
Then prepare the registry with:
hddflyzer data prepare aocd
The default input directory is separate from hddflyzer/data/, which contains
package code. To use another input directory, set HDDFLYZER_DATA_DIR.
PowerShell:
$env:HDDFLYZER_DATA_DIR = "C:\path\to\csvs"
hddflyzer data prepare aocd
Unix-like shells:
HDDFLYZER_DATA_DIR=/path/to/csvs hddflyzer data prepare aocd
Quickstart
Prepare a molecule registry:
hddflyzer data prepare aocd
Run the standard workflow:
hddflyzer pipeline run aocd
Run a lighter workflow that stops before t-SNE and UMAP:
hddflyzer pipeline run aocd --skip-dimred
Execute only selected stages:
hddflyzer pipeline run aocd --stages chem.features,chem.pruning
Load the completed run from Python:
from hddflyzer.results import load_workflow_run
run = load_workflow_run("aocd")
run.workflow_contract
run.outputs(category="dimred")
CLI Workflow
Use the unified CLI:
hddflyzer <module> <subcommand> [args]
Recommended real-data sequence:
hddflyzer data prepare aocd
hddflyzer annotate npc aocd
hddflyzer chem tanimoto aocd
hddflyzer chem sample aocd
hddflyzer chem features aocd
hddflyzer chem curate-features aocd
hddflyzer chem reference-features aocd --reference
hddflyzer chem stats base aocd
hddflyzer chem stats hddf aocd
hddflyzer chem pruning aocd
Dimensionality-reduction commands include:
hddflyzer dimred pca aocd
hddflyzer dimred pca-joint aocd dianatdb
hddflyzer dimred tsne features aocd
hddflyzer dimred tsne tanimoto aocd
hddflyzer dimred tsne pruning aocd
hddflyzer dimred umap features aocd
hddflyzer dimred umap tanimoto aocd
hddflyzer dimred umap pruning aocd
Projection modes mean:
features = descriptor space from full feature table
tanimoto = structural similarity space from Morgan fingerprints
pruning = descriptor space after correlation-based feature pruning
Visualization commands include:
hddflyzer viz npc aocd
hddflyzer viz similarity tanimoto aocd
hddflyzer viz similarity fingerprints aocd
hddflyzer viz correlations hddf aocd
hddflyzer viz pca analysis aocd
hddflyzer viz tsne features aocd QED 30
hddflyzer viz umap features aocd QED
hddflyzer viz umap tanimoto aocd Pathway
Reference-dependent similarity descriptors are intentionally separate from the intrinsic feature table. They describe molecule-reference pairs, not molecules alone:
hddflyzer chem reference-features aocd --reference
Optional sampled feature calculation is available when Tanimoto sampling keeps fewer compounds than the full registry:
hddflyzer chem features aocd --sampled
Optional pickle output is explicit and not the default:
hddflyzer chem features aocd --save-pkl
Programmatic Workflow Execution
The same workflow engine can be called from Python:
from hddflyzer.pipeline import execute_workflow
execution = execute_workflow("aocd")
execution.ok
execution.stage_results
run = execution.run
For lower-level control, run_pipeline() returns list[StageResult], while
run_workflow() returns a reconstructed WorkflowRun or propagates an error if
the run cannot be reconstructed.
Main Outputs
The project uses a dataset-first results layout:
results/
<tag>/
manifest.json
workflow_summary.md
registry/
annotations/
chemistry/
features/
dimred/
figures/
For aocd, representative outputs include:
results/aocd/registry/molecules.csv
results/aocd/annotations/npclassifier/npclassifier.csv
results/aocd/chemistry/tanimoto/tanimoto_matrix.npz
results/aocd/chemistry/tanimoto/tanimoto_ids.csv
results/aocd/features/full/features.csv
results/aocd/features/curated/features_ml.csv
results/aocd/features/pruning/selected_features.txt
results/aocd/dimred/pca/pca_coordinates.csv
results/aocd/dimred/tsne/*/
results/aocd/dimred/umap/*/
results/aocd/manifest.json
results/aocd/workflow_summary.md
Each major step writes one interpretable data file plus one metadata file where
possible. Large matrices are stored in compressed .npz. Pickle output is
optional, not default.
manifest.json is the dataset-level index for a run. It records operation
history, generated files, stale file references, per-operation metadata,
current_outputs, output_categories, and a workflow_contract block that
makes the canonical registry -> chem -> dimred -> viz -> metadata/results
workflow explicit. workflow_summary.md presents the same contract as a concise
human-readable index.
Reconstructing Previous Runs
The result API reconstructs completed runs from results/<tag>/manifest.json.
It does not create new state or recalculate scientific outputs.
from hddflyzer.results import load_workflow_run, load_artifact
run = load_workflow_run("aocd")
artifact = run.artifact(kind="tanimoto_matrix", category="chem")
loaded = load_artifact(artifact)
loaded.artifact
loaded.data
loaded.metadata
Loaded result artifacts can also be selected by semantic kind:
artifact = run.artifact(
kind="descriptor_table",
category="chem",
operation="chem.features",
)
loaded = run.load_artifact(
kind="descriptor_table",
category="chem",
operation="chem.features",
)
loaded.data
loaded.metadata
Supported artifact kinds are:
molecule_registry
descriptor_table
tanimoto_matrix
projection_coordinates
figure
metadata
workflow_summary
unknown
run.artifacts(...) returns all matching artifacts. run.artifact(...) returns
exactly one artifact or raises a clear error when none or multiple candidates
match. run.load_artifact(...) selects one artifact and loads it using the same
contracts as load_artifact(...). The optional required filter can be used to
disambiguate artifacts by a path fragment, for example
required="features/full/features.csv".
Scientific Result Views
Loaded artifacts can be wrapped as lightweight scientific views:
LoadedArtifact -> DescriptorSpace
LoadedArtifact -> SimilaritySpace
LoadedArtifact -> ProjectionSpace
These views do not execute the pipeline and do not recalculate descriptors, similarity, or projections. They wrap artifacts that already exist in a reconstructed run, so Python code can work with results in domain language.
from hddflyzer.results import load_workflow_run
run = load_workflow_run("aocd")
descriptors = run.descriptor_space(
category="chem",
operation="chem.features",
)
similarity = run.similarity_space(category="chem")
projection = run.projection_space(
category="dimred",
operation="dimred.pca",
)
Each scientific space exposes molecule_ids when identifiers are available.
Spaces can be compared or aligned by molecular identity:
from hddflyzer.science import align_spaces, has_aligned_molecule_ids, shared_molecule_ids
shared = shared_molecule_ids(descriptors, projection)
aligned = has_aligned_molecule_ids(descriptors, projection)
descriptors_aligned, projection_aligned = align_spaces(descriptors, projection)
hddflyzer.science also provides small, pure analysis helpers over existing
spaces:
from hddflyzer.science import (
compare_descriptor_groups,
descriptor_projection_correlations,
projection_neighborhood_preservation,
similarity_projection_correlation,
similarity_projection_neighbor_overlap,
)
global_corr = similarity_projection_correlation(similarity, projection)
neighbor_overlap = similarity_projection_neighbor_overlap(similarity, projection, k=10)
descriptor_ranking = descriptor_projection_correlations(descriptors, projection)
local_preservation = projection_neighborhood_preservation(similarity, projection, k=10)
group_differences = compare_descriptor_groups(descriptors, labels="class_label")
These helpers use only already reconstructed artifacts. They do not generate plots, do not perform automatic clustering, do not run enrichment, and do not make automatic chemical interpretations.
Visualization from Results
Visualization inputs can be resolved from reconstructed results:
from hddflyzer.results import load_workflow_run
from hddflyzer.viz import resolve_viz_inputs, plot_hddf_scatters
run = load_workflow_run("aocd")
inputs = resolve_viz_inputs(
run,
kind="descriptor_table",
category="chem",
required="features/full/features.csv",
)
plot_hddf_scatters(inputs)
plot_hddf_scatters() also accepts a loaded descriptor-table artifact:
loaded = run.load_artifact(
kind="descriptor_table",
category="chem",
operation="chem.features",
)
plot_hddf_scatters(loaded)
Safety Notes
HDDFlyzer is designed as a local Python library and CLI application with defensive handling around reconstructed results:
- Pickle (
.pkl) loading is blocked by default in public loaders. - Pickle can be loaded only with
allow_pickle=True, and should be used only for trusted local files produced in a controlled environment. - Run tags are validated against empty values, path traversal, absolute paths, and path separators.
- Artifacts reconstructed from
manifest.jsonmust resolve inside the correspondingresults/<tag>/run directory. update_manifest()rejects files outsideresults/<tag>/.
These checks reduce accidental unsafe file access and unsafe deserialization. They are not a substitute for treating unknown files and manifests as untrusted input.
Project Architecture
HDDFlyzer separates command-line execution, workflow orchestration, result reconstruction, scientific views, visualization, and low-level I/O.
hddflyzer/
cli.py # command-line entry point
config/ # global settings and canonical output paths
pipeline/ # workflow execution contracts and orchestration
results/ # WorkflowRun, ResultArtifact, LoadedArtifact
science/ # scientific views, alignment, metrics, interpretation
io/ # readers, writers, manifests, path safety
data/ # canonical molecule registry logic
chem/ # descriptors, similarity, sampling, pruning, references
dimred/ # PCA, t-SNE, UMAP
viz/ # plotting layer and visualization input resolution
utils/ # shared helpers
The domain modules keep the scientific logic. The io/ layer owns canonical
read/write helpers. The pipeline/ layer exposes run_pipeline(),
run_workflow(), and execute_workflow() for programmatic execution. The
results/ layer reconstructs completed runs from manifest.json through
WorkflowRun, ResultArtifact, and LoadedArtifact.
The central CLI lives in hddflyzer/cli.py. Subpackages may still expose
module-level main() functions for direct execution, but hddflyzer is the
user-facing entry point.
HDDFlyzer follows a simple separation of responsibilities:
pure/helper functions reusable logic for tests, notebooks, and scripts
run() functions file-based workflow execution and manifest updates
main()/hddflyzer command-line parsing and process exit behavior
Descriptor Families
| Family | Description |
|---|---|
| Constitutional | Molecular weight, LogP, H-bonding, rings, rotatable bonds |
| Topological | Chi, Kappa, Balaban, Bertz, graph indices |
| Electronic | Partial charges and VSA descriptors |
| Geometrical | Currently limited; conformer-based 3D descriptors are planned |
| Hybrid | Derived ratios from base descriptors |
| HDDF | QED, lead-likeness, pharmacophore complexity, synthetic accessibility, desirability |
| Reference Similarity | Morgan, FeatMorgan, AtomPair, RDKFingerprint, Torsion, Layered, MACCS; stored under features/reference/ |
| Reference MCS | Maximum common substructure size, molecule atom counts, Tanimoto, and overlap; stored under features/reference/ |
Design Principles
HDDFlyzer follows these output conventions:
CSV = canonical tabular results
JSON = metadata, parameters, provenance, compact summaries
NPZ = compressed numeric matrices
PKL = optional cache only
PNG = visualization output
TXT = only for small human-facing operational files such as selected_features.txt
The goal is low redundancy, reproducible outputs, stable paths, and readable data products suitable for real dataset analysis.
Future Directions
HDDFlyzer currently focuses on 1D/2D descriptor tables, similarity matrices, dimensionality reduction, visualization, and reconstructed result analysis. Future versions may add explicit conformer-generation and 3D descriptor workflows, including reproducible conformer parameters, energy-minimization metadata, and clear separation between 2D descriptor tables and conformer-based 3D descriptor tables.
Development
Run tests:
pytest -q
The test suite currently covers core utilities, chemistry calculations, dimensionality-reduction helpers, configuration, persistence, orchestration, CLI routing, reconstructed results, scientific views, and visualization APIs.
Status
HDDFlyzer is pre-stable research software. The command structure is stabilizing, but output contracts may still evolve as the real-data pipeline matures.
Contributing
Contributions are welcome. Please open an issue before submitting a pull request. Follow the existing code style: NumPy-style docstrings, type hints, and SPDX license headers in all source files.
See CONTRIBUTING.md for full guidelines. Please also read our Code of Conduct.
Citation
If you use HDDFlyzer in your research, please cite it using the metadata in CITATION.cff or the format below:
Contreras-Torres, F. F. and Saldivar-González, F. I. (2026). HDDFlyzer: High-Dimensional Descriptor-based Feature Space Analyzer. https://github.com/NanoBiostructuresRG/hddflyzer
Author
Developed by Flavio F. Contreras-Torres (Tecnologico de Monterrey) and Fernanda I. Saldivar-González (Tecnologico de Monterrey) Monterrey, Mexico - June 2026
License
LGPL-3.0-or-later. See LICENSE, COPYING, and COPYING.LESSER.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hddflyzer-0.1.5.tar.gz.
File metadata
- Download URL: hddflyzer-0.1.5.tar.gz
- Upload date:
- Size: 144.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac2ce5f38318d1e589d31bd233d9a72f2df9b43b157c63cf5cff26b42835d561
|
|
| MD5 |
160da9e2efdc2e05957c1fcd638f6e8d
|
|
| BLAKE2b-256 |
0291c4d546554c9661c52979d756089146f45e1d06e2dd0f2aad205b4f7b54c2
|
Provenance
The following attestation bundles were made for hddflyzer-0.1.5.tar.gz:
Publisher:
publish-to-pypi.yml on NanoBiostructuresRG/hddflyzer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hddflyzer-0.1.5.tar.gz -
Subject digest:
ac2ce5f38318d1e589d31bd233d9a72f2df9b43b157c63cf5cff26b42835d561 - Sigstore transparency entry: 1827359995
- Sigstore integration time:
-
Permalink:
NanoBiostructuresRG/hddflyzer@e1e44b2602df0edd4baf2579a4ade9445f612a46 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/NanoBiostructuresRG
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@e1e44b2602df0edd4baf2579a4ade9445f612a46 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file hddflyzer-0.1.5-py3-none-any.whl.
File metadata
- Download URL: hddflyzer-0.1.5-py3-none-any.whl
- Upload date:
- Size: 142.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81217c9110725096a3d361396efcae51e3a0cccc8fc6a95a4caa2713ba24f203
|
|
| MD5 |
b5e51a9729f250a17dd0f4d807d82bfa
|
|
| BLAKE2b-256 |
d547772ef3fdfff4433bf0e6e4217450227a19438961ff702cb772eab0495d3a
|
Provenance
The following attestation bundles were made for hddflyzer-0.1.5-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on NanoBiostructuresRG/hddflyzer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hddflyzer-0.1.5-py3-none-any.whl -
Subject digest:
81217c9110725096a3d361396efcae51e3a0cccc8fc6a95a4caa2713ba24f203 - Sigstore transparency entry: 1827360027
- Sigstore integration time:
-
Permalink:
NanoBiostructuresRG/hddflyzer@e1e44b2602df0edd4baf2579a4ade9445f612a46 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/NanoBiostructuresRG
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@e1e44b2602df0edd4baf2579a4ade9445f612a46 -
Trigger Event:
workflow_dispatch
-
Statement type: