Modular pipeline for fetching, curating, and encoding molecular datasets using PubChem data and RDKit's Morgan fingerprinting algorithm.
Project description
MOLRAPTOR: Molecular Learning via Rapid Processing of Topological Representations
MOLRAPTOR is a pre-stable modular pipeline for fetching, curating, and encoding molecular datasets using PubChem data and RDKit's Morgan fingerprinting algorithm, designed for cheminformatics workflows and phase 1 machine learning applications in computational drug discovery.
Project Structure
MOLRAPTOR/
├── .github/workflows/
│ ├── ci.yml
│ ├── docs.yml
│ └── publish-to-pypi.yml
├── docs/
│ ├── stylesheets/
│ │ └── extra.css
│ ├── api.md
│ ├── cli.md
│ ├── configuration.md
│ ├── index.md
│ ├── installation.md
│ ├── quickstart.md
│ └── release.md
├── examples/
│ └── example_config.yaml
├── molraptor/
│ ├── __init__.py
│ ├── cli.py
│ ├── config.py
│ ├── curate.py
│ ├── fetch.py
│ ├── fingerprint.py
│ ├── fp_integrity.py
│ ├── pipeline.py
│ ├── pubchem.py
│ ├── result_manager.py
│ ├── validators.py
│ └── version.py
├── tests/
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_public_api.py
│ └── test_version.py
├── .gitignore
├── CHANGELOG.md
├── CITATION.cff
├── COPYING
├── COPYING.LESSER
├── environment.yml
├── LICENSE
├── mkdocs.yml
├── pyproject.toml
└── README.md
Project Identity
Project: MOLRAPTOR
PyPI distribution: molraptor
Import package: molraptor
CLI: molraptor
Version: 0.1.1
License: LGPL-3.0-or-later
Status: alpha / pre-stable
Documentation
The live documentation is published at:
https://nanobiostructuresrg.github.io/molraptor/
Key pages:
Installation
After PyPI publication:
python -m pip install molraptor
For local development:
git clone https://github.com/NanoBiostructuresRG/molraptor.git
cd molraptor
python -m pip install -e .
For development and documentation tools:
python -m pip install -e ".[dev]"
python -m pip install -e ".[docs]"
Quick Start
Run the pipeline with the bundled example configuration:
molraptor run --config examples/example_config.yaml
Run from Python:
from molraptor import MolraptorConfig, run
config = MolraptorConfig.load("examples/example_config.yaml")
run(config)
Scope
| MOLRAPTOR does | MOLRAPTOR does not |
|---|---|
| Fetch molecular properties from PubChem. | Train machine learning models. |
| Curate and validate chemical datasets. | Perform dimensionality reduction. |
| Generate Morgan fingerprints via RDKit. | Support non-PubChem data sources (yet). |
Output ML-ready .npy and .csv artifacts. |
Handle 3D molecular structures. |
| Log failed CIDs for reproducibility. | Support alternative fingerprint types (yet). |
CLI
molraptor --help
molraptor run --help
molraptor --version
Common commands:
molraptor run
molraptor run --config examples/example_config.yaml
molraptor run --config examples/example_config.yaml --verbose
Public API
from molraptor import MolraptorConfig
from molraptor import validate_config
from molraptor import run
from molraptor import DataValidator
from molraptor import __version__
Modules not listed above are importable directly but are not part of the public contract and may change before 1.0.
Input Format
data/
└── dataset.csv <- CSV with PubChem CIDs and labels
Minimum required columns: PubChem CID, Label.
Outputs
artifacts/
├── morgan_fp.csv # Morgan fingerprints (human-readable)
├── morgan_db_*.npy # Morgan fingerprints (NumPy array, shape: N×size)
├── labels.npy # Target labels (NumPy array, shape: N,)
└── summary.txt # Execution report
Local inputs and generated artifacts such as data/, artifacts/, and logs/
are intentionally ignored by Git.
Validation
The current dev/v0.1.1 branch targets:
python -m pytest tests/ -v
mkdocs build --strict
python -m build --no-isolation
python -m twine check dist/*
molraptor --help
molraptor run --help
molraptor --version
Citation
If you use MOLRAPTOR in your research, please cite it using the metadata in CITATION.cff.
Author
Developed by Flavio F. Contreras-Torres. Tecnologico de Monterrey
License
This project is licensed under the terms of the GNU Lesser General Public License v3.0 or later.
SPDX identifier: LGPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file molraptor-0.1.1.tar.gz.
File metadata
- Download URL: molraptor-0.1.1.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1195dd34ff4889dd403e0b8d24e834a71d0d8f45408d6bf5ca3c9c71a66738c6
|
|
| MD5 |
c30947908c23e33941d2db309449adad
|
|
| BLAKE2b-256 |
1838af187e7ee827bb209cf64eefd301579a9108de0e33e5581703421eab046a
|
Provenance
The following attestation bundles were made for molraptor-0.1.1.tar.gz:
Publisher:
publish-to-pypi.yml on NanoBiostructuresRG/molraptor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
molraptor-0.1.1.tar.gz -
Subject digest:
1195dd34ff4889dd403e0b8d24e834a71d0d8f45408d6bf5ca3c9c71a66738c6 - Sigstore transparency entry: 1659711368
- Sigstore integration time:
-
Permalink:
NanoBiostructuresRG/molraptor@83063cbc257f6089270472bbb2400b102af1b21e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/NanoBiostructuresRG
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@83063cbc257f6089270472bbb2400b102af1b21e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file molraptor-0.1.1-py3-none-any.whl.
File metadata
- Download URL: molraptor-0.1.1-py3-none-any.whl
- Upload date:
- Size: 32.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3b21de23b3d97f6c5dabc7836577203c563983d7206174167acdd809ba18c12
|
|
| MD5 |
f22ce1b9252927f516ba964c59d39454
|
|
| BLAKE2b-256 |
b5b9991cf7ae24628264ccbe328e2689a814acdc95ed1b309a970feb2a61a9c4
|
Provenance
The following attestation bundles were made for molraptor-0.1.1-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on NanoBiostructuresRG/molraptor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
molraptor-0.1.1-py3-none-any.whl -
Subject digest:
f3b21de23b3d97f6c5dabc7836577203c563983d7206174167acdd809ba18c12 - Sigstore transparency entry: 1659711495
- Sigstore integration time:
-
Permalink:
NanoBiostructuresRG/molraptor@83063cbc257f6089270472bbb2400b102af1b21e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/NanoBiostructuresRG
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@83063cbc257f6089270472bbb2400b102af1b21e -
Trigger Event:
workflow_dispatch
-
Statement type: