A package for processing PubChem data and managing a database of chemical compounds.
Project description
MolID
MolID is a Python toolkit and CLI for resolving, validating, and enriching chemical identifiers from PubChem — either online via API, offline via local databases, or through a smart hybrid AUTO mode.
It is designed to provide robust compound lookup, CAS mapping, and caching for workflows that integrate chemical metadata into larger material-science or data-infrastructure projects.
Key Features
Flexible Search Sources
MolID now uses an ordered list of sources instead of legacy "modes". You can flexibly mix offline databases, cache, and API access using configuration options.
| Source | Description |
|---|---|
master |
The read-only, static master PubChem database (for NOMAD or labeling). |
cache |
User’s local cache database with previously queried compounds. |
api |
Live PubChem REST API queries (optionally writing to cache). |
This replaces the old AUTO / offline-basic / online-cached modes. You can combine sources, e.g.:
["master"]– strictly offline using master DB only.["cache"]– strictly offline using the local cache DB.["cache", "api"]– default hybrid mode (use cache, fall back to PubChem).["master", "cache", "api"]– prefer offline data, then API as fallback.
Supported Identifiers
- CID, CAS, InChI, InChIKey, SMILES, MolecularFormula, and Name.
- Auto-normalization of identifiers (e.g. SMILES → InChIKey).
- Isotope-aware InChIKey generation from ASE
Atomsobjects using OpenBabel.
Databases
- Offline database: built from PubChem
.sdf.gzdumps.- Tracks processed archives.
- Can be updated incrementally.
- Cache database: stores API query results for faster future lookups.
- Includes compound and CAS mapping tables.
CAS Enrichment
- Map PubChem CIDs to CAS numbers via PubChem xrefs.
- Automatic generic CAS detection and confidence downgrading.
- Validation of bidirectional CID↔CAS mappings.
- Supports concurrent enrichment of large datasets.
Configurable Settings
- Fully managed by
pydantic.BaseSettingsand.envfile (~/.molid.env). - Includes timeouts, retries, and throttling controls for API calls.
- Easily editable via CLI (
molid config ...).
CLI Commands
| Command | Description |
|---|---|
molid config |
Manage configuration and modes |
molid db create |
Create new offline database |
molid db update |
Fetch & process PubChem archives |
molid db enrich-cas |
Enrich database with CAS mappings |
molid search |
Query molecules from any mode |
Installation
Requirements
- Python ≥ 3.8
- Optional dependency: OpenBabel (for
.xyz/ ASE Atoms → InChIKey conversion) - Optional system libs on Linux:
sudo apt install libxrender1 libxext6
Install from source
pip install molid
Optional: Enable OpenBabel support
MolID can optionally use OpenBabel to convert .xyz or ASE Atoms structures into InChIKeys.
If you only search by SMILES, InChI, or InChIKey, you can skip this dependency.
To enable OpenBabel support:
pip install molid[openbabel]
# or, alternatively:
pip install openbabel-wheel
If OpenBabel is not installed and you run an .xyz or Atoms search, MolID will show:
ERROR: Missing optional dependency 'openbabel'. Install it to enable XYZ/Atoms → InChIKey conversion.
Configuration
MolID reads from environment variables or ~/.molid.env.
All variables are prefixed MOLID_.
| Variable | Default | Description |
|---|---|---|
MOLID_MASTER_DB |
pubchem_data_FULL.db |
Path to offline master database |
MOLID_CACHE_DB |
pubchem_cache.db |
Path to API cache database |
MOLID_SOURCES |
cache,api |
Ordered list of data sources (master, cache, api) |
MOLID_CACHE_WRITES |
True |
Whether API results are written into the cache database |
MOLID_DOWNLOAD_FOLDER |
~/.cache/molid/downloads |
Folder for PubChem .sdf.gz archives |
MOLID_PROCESSED_FOLDER |
~/.local/share/molid/processed |
Folder for unpacked .sdf files |
MOLID_LOG_FILE |
~/.local/share/molid/molid.log |
Default log file |
MOLID_HTTP_CONNECT_TIMEOUT |
10 | API connection timeout (s) |
MOLID_HTTP_READ_TIMEOUT |
35 | API read timeout (s) |
MOLID_HTTP_RETRIES |
4 | Retry attempts |
MOLID_HTTP_BACKOFF |
0.7 | Backoff factor between retries |
Example CLI setup
molid config set-master /data/molid/pubchem_master.db
molid config set-cache ~/.cache/molid/pubchem_cache.db
molid config set-sources cache api
molid config set-cache-writes true
molid config show
Usage
Create and Update Database
# Create empty offline DB
molid db create --db-file pubchem_data.db
# Download and ingest new PubChem SDF batches
molid db update --max-files 10
Enrich CAS mappings
molid db enrich-cas --limit 100000
Search Examples
# Search by InChIKey
molid search QGZKDVFQNNGYKY-UHFFFAOYSA-N --id-type inchikey
# Search by formula
molid search H2O --id-type molecularformula
# Auto-detect identifier type
molid search 25322-68-3
Outputs a JSON block including compound properties and data source.
Python API
from molid.main import run
from molid.pipeline import search_identifier
# From an ASE Atoms object
results, source = run(atoms)
# From a SMILES string
results, source = search_identifier({"smiles": "C1=CC=CC=C1"})
Additional helpers:
search_from_file(path)→ handles.xyz,.extxyz,.sdfsearch_from_atoms(atoms)→ handles ASEAtomssearch_from_input(data)→ infers type automatically
Architecture Overview
molid/
├── init.py
├── main.py # Enables python -m molid CLI execution
├── main.py # High-level programmatic entrypoint (API wrapper)
├── cli.py # Command-line interface: config, DB ops, search
├── pipeline.py # Unified search orchestration (Atoms, file, identifier)
│
├── search/
│ ├── init.py
│ ├── service.py # Central search engine with offline/online/auto modes
│ └── db_lookup.py # SQLite lookup logic for offline and cache DBs
│
├── db/
│ ├── init.py
│ ├── schema.py # Centralized SQLite schema & property definitions
│ ├── db_utils.py # Database creation, initialization, UPSERT helpers
│ ├── sqlite_manager.py # Generic SQLite wrapper (queries, inserts, schema setup)
│ ├── offline_db_cli.py # CLI for managing PubChem offline archives (download, ingest, enrich)
│ ├── cas_enrich.py # Parallel CAS↔CID enrichment, confidence scoring, and generic-CAS detection
│ └── cas_enrich.py
│
├── pubchemproc/
│ ├── init.py
│ ├── pubchem.py # SDF file parsing and extraction of compound records
│ ├── fetch.py # High-level data retrieval and enrichment via PubChem REST API
│ ├── pubchem_client.py # Session management, retry policies, and endpoint resolution
│ ├── cache.py # Cache database management and store/fetch synchronization
│ └── file_handler.py # File utilities for .gz, .sdf, and MD5 validation
│
├── utils/
│ ├── init.py
│ ├── formula.py # Formula parsing and Hill-system canonicalization
│ ├── conversion.py # SMILES/InChI/XYZ conversion and isotope tagging via OpenBabel
│ ├── identifiers.py # Identifier normalization and type coercion
│ ├── ftp_utils.py # FTP/HTTP logic for downloading PubChem archives
│ ├── disk_utils.py # Disk space validation utilities
│ └── settings.py # Pydantic-based configuration loader & persistence
Development & Testing
pytest -v
black .
flake8 .
Optional integration test:
molid search --id-type smiles C
License
MolID is released under the Apache License 2.0. See the LICENSE file for full details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file molid-0.8.5.tar.gz.
File metadata
- Download URL: molid-0.8.5.tar.gz
- Upload date:
- Size: 59.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f69da64bd7e0f884d0640f7231c29ed9ebd1fff3d565463e39bd87c5313165f
|
|
| MD5 |
41ed9015b77a66811d3647ce748f30b6
|
|
| BLAKE2b-256 |
bc45b83788d9f1a8fa763e39bfbcbac6a4aed3f3ad192dc5c8e808c9b4bee2d5
|
File details
Details for the file molid-0.8.5-py3-none-any.whl.
File metadata
- Download URL: molid-0.8.5-py3-none-any.whl
- Upload date:
- Size: 54.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f016b8220d619db3751c2079ea3d4d8b6e0ca4ee9aee0c9f75388751c8a27ce0
|
|
| MD5 |
2c7c1cc722f961b7246d0cbbf959f1f7
|
|
| BLAKE2b-256 |
4c1114c9b40ab5aa27f40b88d4754a18dd0e457aab03ae6d66de897577fb5697
|