A comprehensive CLI tool for RDKit cheminformatics operations
Project description
rdkit-cli
A comprehensive, high-performance CLI tool wrapping RDKit functionality for cheminformatics workflows.
Features
- 29 Command Categories: align, conformers, convert, deduplicate, depict, descriptors, diversity, enumerate, filter, fingerprints, fragment, info, mcs, merge, mmp, props, protonate, reactions, rgroup, rings, rmsd, sample, sascorer, scaffold, similarity, split, standardize, stats, validate
- Multiple Input/Output Formats: CSV, TSV, SMI, SDF, Parquet
- Parallel Processing: Efficient multi-core support via ProcessPoolExecutor
- Ninja-style Progress: Real-time progress display with speed and ETA
Installation
pip install rdkit-cli
Or with uv:
uv add rdkit-cli
Quick Start
# Compute molecular descriptors
rdkit-cli descriptors compute -i molecules.csv -o descriptors.csv -d MolWt,MolLogP,TPSA
# Generate fingerprints
rdkit-cli fingerprints compute -i molecules.csv -o fingerprints.csv --type morgan
# Filter by drug-likeness
rdkit-cli filter druglike -i molecules.csv -o filtered.csv --rule lipinski
# Standardize molecules
rdkit-cli standardize -i molecules.csv -o standardized.csv --cleanup --neutralize
# Similarity search
rdkit-cli similarity search -i library.csv -o hits.csv --query "c1ccccc1" --threshold 0.7
Commands
descriptors
Compute molecular descriptors.
# List available descriptors
rdkit-cli descriptors list
rdkit-cli descriptors list --all
# Compute specific descriptors
rdkit-cli descriptors compute -i input.csv -o output.csv -d MolWt,MolLogP,TPSA
# Compute all descriptors
rdkit-cli descriptors compute -i input.csv -o output.csv --all
fingerprints
Generate molecular fingerprints.
# List available fingerprint types
rdkit-cli fingerprints list
# Compute Morgan fingerprints (default)
rdkit-cli fingerprints compute -i input.csv -o output.csv --type morgan
# With options
rdkit-cli fingerprints compute -i input.csv -o output.csv \
--type morgan --radius 3 --bits 4096 --use-chirality
Supported types: morgan, maccs, rdkit, atompair, torsion, pattern
filter
Filter molecules by various criteria.
# Substructure filter
rdkit-cli filter substructure -i input.csv -o output.csv --smarts "c1ccccc1"
rdkit-cli filter substructure -i input.csv -o output.csv --smarts "c1ccccc1" --exclude
# Property filter
rdkit-cli filter property -i input.csv -o output.csv --rule "MolWt < 500"
# Drug-likeness filters
rdkit-cli filter druglike -i input.csv -o output.csv --rule lipinski
rdkit-cli filter druglike -i input.csv -o output.csv --rule veber
rdkit-cli filter druglike -i input.csv -o output.csv --rule ghose
# PAINS filter
rdkit-cli filter pains -i input.csv -o output.csv
convert
Convert between molecular file formats.
# Auto-detect formats from extensions
rdkit-cli convert -i molecules.csv -o molecules.sdf
# Explicit format specification
rdkit-cli convert -i molecules.csv -o molecules.smi --out-format smi
Supported formats: csv, tsv, smi, sdf, parquet
standardize
Standardize and canonicalize molecules.
# Basic standardization
rdkit-cli standardize -i input.csv -o output.csv
# With options
rdkit-cli standardize -i input.csv -o output.csv \
--cleanup --neutralize --fragment-parent
similarity
Compute molecular similarity.
# Similarity search
rdkit-cli similarity search -i library.csv -o hits.csv \
--query "CCO" --threshold 0.7
# Similarity matrix
rdkit-cli similarity matrix -i molecules.csv -o matrix.csv \
--metric tanimoto
# Clustering
rdkit-cli similarity cluster -i molecules.csv -o clustered.csv \
--cutoff 0.5
conformers
Generate and optimize 3D conformers.
# Generate conformers
rdkit-cli conformers generate -i input.csv -o output.sdf --num 10
# Optimize conformers
rdkit-cli conformers optimize -i input.sdf -o optimized.sdf --force-field mmff
reactions
Apply chemical reactions and transformations.
# SMIRKS transformation
rdkit-cli reactions transform -i input.csv -o output.csv \
--smirks "[OH:1]>>[O-:1]"
# Reaction enumeration
rdkit-cli reactions enumerate -i reactants.csv -o products.csv \
--template "reaction.rxn"
scaffold
Extract molecular scaffolds.
# Murcko scaffolds
rdkit-cli scaffold murcko -i input.csv -o scaffolds.csv
# Generic scaffolds
rdkit-cli scaffold murcko -i input.csv -o scaffolds.csv --generic
# Scaffold decomposition
rdkit-cli scaffold decompose -i input.csv -o decomposed.csv
enumerate
Enumerate molecular variants.
# Stereoisomers
rdkit-cli enumerate stereoisomers -i input.csv -o isomers.csv --max-isomers 32
# Tautomers
rdkit-cli enumerate tautomers -i input.csv -o tautomers.csv --max-tautomers 50
# Canonical tautomer
rdkit-cli enumerate canonical-tautomer -i input.csv -o canonical.csv
fragment
Fragment molecules.
# BRICS fragmentation
rdkit-cli fragment brics -i input.csv -o fragments.csv
# RECAP fragmentation
rdkit-cli fragment recap -i input.csv -o fragments.csv
# Functional group extraction
rdkit-cli fragment functional-groups -i input.csv -o groups.csv
# Fragment frequency analysis
rdkit-cli fragment analyze -i fragments.csv -o analysis.csv
diversity
Analyze and select diverse molecules.
# Pick diverse subset
rdkit-cli diversity pick -i input.csv -o diverse.csv -k 100
# Analyze diversity
rdkit-cli diversity analyze -i input.csv
mcs
Find Maximum Common Substructure.
# Find MCS across molecules
rdkit-cli mcs -i molecules.csv -o mcs_result.csv
# With options
rdkit-cli mcs -i molecules.csv -o mcs_result.csv \
--timeout 60 --atom-compare elements
depict
Generate molecular depictions.
# Single molecule
rdkit-cli depict single --smiles "c1ccccc1" -o benzene.svg
# Batch depiction
rdkit-cli depict batch -i molecules.csv -o images/ -f svg
# Grid image
rdkit-cli depict grid -i molecules.csv -o grid.svg --mols-per-row 4
stats
Calculate dataset statistics.
# Basic statistics
rdkit-cli stats -i molecules.csv -o stats.json --format json
# Specific properties
rdkit-cli stats -i molecules.csv -p MolWt,LogP,TPSA
# List available properties
rdkit-cli stats -i molecules.csv --list-properties
split
Split files into smaller chunks.
# Split into N files
rdkit-cli split -i large.csv -o chunks/ -c 10
# Split by chunk size
rdkit-cli split -i large.csv -o chunks/ -s 1000
# With custom prefix
rdkit-cli split -i large.csv -o chunks/ -c 5 --prefix molecules
sample
Randomly sample molecules.
# Sample by count
rdkit-cli sample -i molecules.csv -o sample.csv -k 100 --seed 42
# Sample by fraction
rdkit-cli sample -i molecules.csv -o sample.csv -f 0.1
# Memory-efficient streaming (reservoir sampling)
rdkit-cli sample -i huge.csv -o sample.csv -k 1000 --stream
deduplicate
Remove duplicate molecules.
# Deduplicate by canonical SMILES (default)
rdkit-cli deduplicate -i molecules.csv -o unique.csv
# Deduplicate by InChIKey
rdkit-cli deduplicate -i molecules.csv -o unique.csv -b inchikey
# Deduplicate by scaffold
rdkit-cli deduplicate -i molecules.csv -o unique.csv -b scaffold
# Keep last occurrence instead of first
rdkit-cli deduplicate -i molecules.csv -o unique.csv --keep last
validate
Validate molecular structures.
# Basic validation
rdkit-cli validate -i molecules.csv -o validated.csv
# Output only valid molecules
rdkit-cli validate -i molecules.csv -o valid.csv --valid-only
# With constraints
rdkit-cli validate -i molecules.csv -o validated.csv \
--max-atoms 100 --max-rings 8
# Check allowed elements
rdkit-cli validate -i molecules.csv -o validated.csv \
--allowed-elements C,H,N,O,S,F,Cl
# Check stereo and show summary
rdkit-cli validate -i molecules.csv -o validated.csv \
--check-stereo --summary
info
Quick molecule information from SMILES.
# Basic info
rdkit-cli info "CCO"
# JSON output
rdkit-cli info "c1ccccc1" --json
# Shows: formula, MW, LogP, TPSA, stereocenters, Lipinski violations, InChI/InChIKey
merge
Combine multiple molecule files.
# Merge two files
rdkit-cli merge -i file1.csv file2.csv -o merged.csv
# Merge with deduplication
rdkit-cli merge -i file1.csv file2.csv -o merged.csv --dedupe
# Track source file
rdkit-cli merge -i file1.csv file2.csv -o merged.csv --source-column source
sascorer
Calculate synthetic accessibility and drug-likeness scores.
# SA Score only (default)
rdkit-cli sascorer -i molecules.csv -o scores.csv
# Include QED score
rdkit-cli sascorer -i molecules.csv -o scores.csv --qed
# Include Natural Product-likeness score
rdkit-cli sascorer -i molecules.csv -o scores.csv --npc
# All scores
rdkit-cli sascorer -i molecules.csv -o scores.csv --qed --npc
rgroup
R-group decomposition around a core structure.
# Decompose around benzene core
rdkit-cli rgroup -i molecules.csv -o decomposed.csv --core "c1ccc([*:1])cc1"
# Multiple attachment points
rdkit-cli rgroup -i molecules.csv -o decomposed.csv \
--core "c1ccc([*:1])cc([*:2])1"
rings
Ring system analysis.
# Extract ring systems
rdkit-cli rings extract -i molecules.csv -o rings.csv
# Ring information (counts, sizes, aromaticity)
rdkit-cli rings info -i molecules.csv -o ring_info.csv
# Frequency analysis
rdkit-cli rings frequency -i molecules.csv -o ring_freq.csv
align
3D molecular alignment.
# Align to reference structure (MCS-based)
rdkit-cli align -i probes.sdf -o aligned.sdf -r reference.sdf
# Open3DAlign method
rdkit-cli align -i probes.sdf -o aligned.sdf -r reference.sdf --method o3a
rmsd
RMSD calculations between 3D structures.
# Compare to reference
rdkit-cli rmsd compare -i molecules.sdf -o results.csv -r reference.sdf
# Pairwise RMSD matrix
rdkit-cli rmsd matrix -i molecules.sdf -o matrix.csv
# Conformer RMSD analysis
rdkit-cli rmsd conformers -i multi_conf.sdf -o conf_rmsd.csv
mmp
Matched Molecular Pairs analysis.
# Fragment molecules for MMP
rdkit-cli mmp fragment -i molecules.csv -o fragments.csv
# Find matched pairs
rdkit-cli mmp pairs -i fragments.csv -o pairs.csv
# Apply MMP transformation
rdkit-cli mmp transform -i molecules.csv -o transformed.csv \
-t "[c:1][CH3]>>[c:1][NH2]"
protonate
Protonation state enumeration.
# Enumerate at physiological pH
rdkit-cli protonate -i molecules.csv -o protonated.csv --ph 7.4
# Neutralize charged molecules
rdkit-cli protonate -i molecules.csv -o neutral.csv --neutralize
# Enumerate all states
rdkit-cli protonate -i molecules.csv -o states.csv --enumerate-all
props
Property column operations.
# Add a column
rdkit-cli props add -i molecules.csv -o output.csv -c series -v "series_A"
# Rename a column
rdkit-cli props rename -i molecules.csv -o output.csv --from name --to mol_name
# Drop columns
rdkit-cli props drop -i molecules.csv -o output.csv -c col1,col2
# Keep only specific columns
rdkit-cli props keep -i molecules.csv -o output.csv -c smiles,name,MolWt
# List columns
rdkit-cli props list -i molecules.csv
Global Options
| Option | Description |
|---|---|
-n, --ncpu N |
Number of CPUs (-1 = all, default: -1) |
-i, --input FILE |
Input file |
-o, --output FILE |
Output file |
--smiles-column COL |
SMILES column name (default: "smiles") |
--name-column COL |
Name column (optional) |
--no-header |
Input has no header row |
-q, --quiet |
Suppress progress output |
-V, --version |
Show version |
-h, --help |
Show help |
Input/Output Formats
| Format | Extension | Notes |
|---|---|---|
| CSV | .csv | Comma-separated, with header |
| TSV | .tsv | Tab-separated, with header |
| SMI | .smi | SMILES format, space-separated |
| SDF | .sdf | Structure-Data File |
| Parquet | .parquet | Apache Parquet format |
Examples
Cheminformatics Pipeline
# 1. Validate and filter input
rdkit-cli validate -i raw.csv -o validated.csv --valid-only
# 2. Deduplicate
rdkit-cli deduplicate -i validated.csv -o unique.csv -b inchikey
# 3. Standardize molecules
rdkit-cli standardize -i unique.csv -o std.csv --cleanup --neutralize
# 4. Filter by drug-likeness
rdkit-cli filter druglike -i std.csv -o druglike.csv --rule lipinski
# 5. Compute descriptors
rdkit-cli descriptors compute -i druglike.csv -o desc.csv -d MolWt,MolLogP,TPSA,HBD,HBA
# 6. Get dataset statistics
rdkit-cli stats -i druglike.csv -o stats.json --format json
# 7. Select diverse subset
rdkit-cli diversity pick -i druglike.csv -o diverse.csv -k 500
# 8. Generate depictions
rdkit-cli depict grid -i diverse.csv -o library.svg --mols-per-row 10
Similarity Screening
# Search for similar compounds
rdkit-cli similarity search -i library.csv -o hits.csv \
--query "CC(=O)Oc1ccccc1C(=O)O" \
--threshold 0.6 \
--type morgan
# Cluster results
rdkit-cli similarity cluster -i hits.csv -o clustered.csv --cutoff 0.4
Scaffold Analysis
# Extract scaffolds
rdkit-cli scaffold murcko -i library.csv -o scaffolds.csv
# Analyze scaffold diversity
rdkit-cli diversity analyze -i scaffolds.csv --smiles-column scaffold
Large Dataset Processing
# Sample from a huge dataset
rdkit-cli sample -i huge_library.csv -o sample.csv -k 10000 --stream
# Split for parallel processing
rdkit-cli split -i library.csv -o batches/ -c 10
# Process batches in parallel (using xargs)
ls batches/*.csv | xargs -P 4 -I {} rdkit-cli descriptors compute -i {} -o {}.desc.csv -d MolWt,LogP
Development
# Clone repository
git clone https://github.com/vitruves/rdkit-cli
cd rdkit-cli
# Install with dev dependencies
uv sync --dev
# Run tests
uv run pytest
# Run with coverage
uv run pytest --cov=rdkit_cli
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rdkit_cli-0.3.0.tar.gz.
File metadata
- Download URL: rdkit_cli-0.3.0.tar.gz
- Upload date:
- Size: 106.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ca436a7d3a53b4113f6863c6adca10f2573457870f679506494f6589ceabdbb
|
|
| MD5 |
b1fbc86a14629a0ef7e2a23361ef8ae6
|
|
| BLAKE2b-256 |
21177cde731f81364aeeb6bea4f8a5dd5a3aec5e81f6ca811487bfabb74ccceb
|
Provenance
The following attestation bundles were made for rdkit_cli-0.3.0.tar.gz:
Publisher:
publish.yml on Vitruves/rdkit-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rdkit_cli-0.3.0.tar.gz -
Subject digest:
6ca436a7d3a53b4113f6863c6adca10f2573457870f679506494f6589ceabdbb - Sigstore transparency entry: 813312422
- Sigstore integration time:
-
Permalink:
Vitruves/rdkit-cli@8a3db519907158f395dec0fb54807267c230451f -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Vitruves
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8a3db519907158f395dec0fb54807267c230451f -
Trigger Event:
release
-
Statement type:
File details
Details for the file rdkit_cli-0.3.0-py3-none-any.whl.
File metadata
- Download URL: rdkit_cli-0.3.0-py3-none-any.whl
- Upload date:
- Size: 123.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5d951fd547e9d2d6621f7e15a7b09090e03b9917398b811142f4283c5eb7d4f
|
|
| MD5 |
a8cc2cc4da2155062327ca1537685b42
|
|
| BLAKE2b-256 |
6b1dd12fc7770c986289b39cf82ae7c2d82270fb84c93ba98315ea8e49884091
|
Provenance
The following attestation bundles were made for rdkit_cli-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on Vitruves/rdkit-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rdkit_cli-0.3.0-py3-none-any.whl -
Subject digest:
d5d951fd547e9d2d6621f7e15a7b09090e03b9917398b811142f4283c5eb7d4f - Sigstore transparency entry: 813312423
- Sigstore integration time:
-
Permalink:
Vitruves/rdkit-cli@8a3db519907158f395dec0fb54807267c230451f -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Vitruves
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8a3db519907158f395dec0fb54807267c230451f -
Trigger Event:
release
-
Statement type: