Skip to main content

Quantum-ready molecular toolkit. An open-source Python library that pulls PubChem records, structures them into Molecule objects with rich property sets, and enriches each with IBM Granite summaries and embeddings—ready for quantum simulations, AI pipelines, and scientific research.

Project description

🧪 robotu-molkit

🚧 This project is under active development. Expect frequent changes as we build the foundation for quantum-ready molecular discovery.

Quantum-ready molecular toolkit.
robotu-molkit is the first library to enrich PubChem molecules with AI-native context: each molecule is converted into a simulation-ready Molecule object and annotated using IBM Granite models to generate both human-readable summaries and high-dimensional embeddings—bridging chemistry, AI, and quantum workflows.


🔍 About

robotu-molkit is part of the RobotU Quantum ecosystem, and it's the first open-source toolkit to unify molecular data curation, AI enrichment, and semantic search—designed from the ground up for quantum and AI workflows.

Unlike traditional cheminformatics libraries, robotu-molkit goes beyond parsing: it integrates IBM watsonx Granite models to generate natural-language summaries and high-dimensional vector embeddings for each molecule. These AI-generated fingerprints capture not just structure, but meaning—enabling search queries like "low-toxicity CNS stimulants under 250 Da" to return relevant results instantly.

robotu-molkit ingests records from PubChem, standardizes >10 property categories (geometry, quantum, spectra, safety, solubility, etc.), and outputs clean Molecule objects with embedded context-aware vectors. Molecules can be searched semantically, compared structurally, or exported into simulation pipelines—making it ideal for researchers in quantum chemistry, drug discovery, and AI-accelerated science.

It’s the first library to:

  • Embed both summaries and molecular sections using Granite Embedding
  • Enable similarity search powered by local FAISS (Milvus vector database throug watsonx, comming soon).
  • Support hybrid semantic + structure-based filtering via Tanimoto + AI vectors

In short: robotu-molkit turns raw chemical records into simulation-ready, AI-searchable molecules.

📦 Installation

git clone https://github.com/robotu-ai/robotu-molkit.git
cd robotu-molkit
pip install -e .

🛠️ CLI Usage

robotu‑molkit ships with a single entry‑point, molkit, that orchestrates each pipeline stage.

ℹ️ Run molkit --help or molkit <command> --help for full option details.

0. Configure (one‑time)

molkit config --watsonx-api-key $WATSONX_API_KEY --watsonx-project-id $WATSONX_PROJECT_ID

1. Ingest — download & parse PubChem records

molkit ingest 2244 1983 3675
molkit ingest --file path/to/cids.txt
molkit ingest 2244 1983 --concurrency 8

2. Embed — enrich with Granite summaries & vectors

molkit embed
molkit embed --fast

3. Upload — not yet implemented

Currently stops after generating watsonx_vectors.jsonl.


🧬 Available Fields

🧾 Identifiers and Names

  • name, inchi, inchikey, smiles, cid, formula, molecular_weight

⚛️ Structure and Geometry

  • xyz, heavy_atom_count, ring_count, aromatic_ring_count, rotatable_bonds, fsp3, bertz_ct

🧪 Properties

  • hbond_donors, hbond_acceptors, tpsa, logp, logs, ghs_codes, hazard_tag, solubility_tag, spectra_tag, chem_tag

🧠 Embeddings and Metadata

  • summary, structure, ecfp, maccs

Possible solubility_tag values and their thresholds:

  • unknown solubility

    • When log‐solubility (logs) is None.
  • very soluble

    • logs > -0.5
  • soluble

    • -1.5 < logs ≤ -0.5
  • moderately soluble

    • -3.0 < logs ≤ -1.5
  • sparingly soluble

    • -4.0 < logs ≤ -3.0
  • insoluble

    • logs ≤ -4.0

💡 Search Examples

from robotu_molkit.credentials_manager import CredentialsManager
from robotu_molkit.search.searcher import LocalSearch
from robotu_molkit.constants import DEFAULT_JSONL_FILE_ROUTE

WATSON_API_KEY = ""
WATSON_PROJECT_ID = ""
CredentialsManager.set_api_key(WATSON_API_KEY)
CredentialsManager.set_project_id(WATSON_PROJECT_ID)

searcher = LocalSearch(jsonl_path=DEFAULT_JSONL_FILE_ROUTE)
query_text = "methylxanthine derivatives with central nervous system stimulant activity"
filters = {"molecular_weight": (0, 250), "solubility_tag": "soluble"}

results1 = searcher.search_by_semantics(
    query_text=query_text, top_k=20, faiss_k=300, filters=filters
)

entries = [f"CID {m['cid']} Name:{m.get('name','<unknown>')} MW:{m.get('molecular_weight',0):.1f} Sol:{m.get('solubility_tag','')} Score:{s:.3f}" for m,s in results1]
print(f"Top {len(entries)} hits:\n" + "\n".join(entries))

results2 = searcher.search_by_semantics_and_structure(
    query_text=query_text, top_k=20, faiss_k=300, filters=filters, sim_threshold=0.70
)

entries = [f"CID {m['cid']} Name:{m.get('name','<unknown>')} MW:{m.get('molecular_weight',0):.1f} Sol:{m.get('solubility_tag','')} Score:{s:.3f} Tanimoto:{sim:.2f}" for m,s,sim in results2]
print(f"Top {len(entries)} hits (Granite scaffolds, Tanimoto ≥ {SIM_THRESHOLD}):\n" + "\n".join(entries))

Parameter filters

The filters parameter of search_by_semantics and search_by_semantics_and_structure allows you to refine results based on metadata. It’s a Python dict mapping field names to conditions:

python filters: Dict[str, Any] = { 'field': condition, # … }

Condition types

  • Single value (equality)
    python filters = { 'solubility_tag': 'High' }
    Only entries where meta['solubility_tag'] == 'High' pass.

  • Range (tuple)
    python filters = { 'molecular_weight': (100, 500) }
    Only entries where 100 <= meta['molecular_weight'] <= 500 pass.

  • List (membership)
    python filters = { 'cid': [123, 456, 789] }
    Only entries where meta['cid'] is in the list pass.


Internally, filtering is done like this:

def passes(m: Dict[str, Any]) -> bool:
    for k, cond in filters.items():
        v = m.get(k)
        if isinstance(cond, tuple):
            if v is None or not (cond[0] <= v <= cond[1]):
                return False
        elif isinstance(cond, list):
            if v not in cond:
                return False
        else:
            if v != cond:
                return False
    return True

filtered = [(m, s) for m, s in hits if passes(m)][:top_k]

Example usage

my_filters = {
    'solubility_tag': 'soluble',
    'molecular_weight': (150, 450),
}

results = client.search_by_semantics(
    query_text="kinase inhibitor",
    top_k=20,
    filters=my_filters
)

📄 License

Apache 2.0 License — see LICENSE file.


RobotU Quantum — accelerating discovery through open, AI-enhanced, quantum-ready data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotu_molkit-0.1.0.tar.gz (44.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robotu_molkit-0.1.0-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file robotu_molkit-0.1.0.tar.gz.

File metadata

  • Download URL: robotu_molkit-0.1.0.tar.gz
  • Upload date:
  • Size: 44.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for robotu_molkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dd46a6ddb6116efbb4151c23556737f686d22345a4acafd891720a034f67a596
MD5 aa634fb551074ac9e4df1637cc37c800
BLAKE2b-256 aa4a834b9c5361b0f8534c58f1a7d3a676493ba9a044736139d3793eae84af08

See more details on using hashes here.

Provenance

The following attestation bundles were made for robotu_molkit-0.1.0.tar.gz:

Publisher: python-publish.yml on Robotu-ai/robotu-molkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file robotu_molkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: robotu_molkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for robotu_molkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4e77538ac6597835b637de9adab5e63e955bb4868ca07a30321b4e44b28eeb0f
MD5 cc04784e0c076a11e943c88cebce5e5f
BLAKE2b-256 6ecd025f6a065188cebf2d8a53ad60ebbd82cc61b5fcffbc5862ebe94cac3bff

See more details on using hashes here.

Provenance

The following attestation bundles were made for robotu_molkit-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on Robotu-ai/robotu-molkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page