Chemistry-aware Polars enabled by OpenEye
Project description
OEPolars
Deep integration of OpenEye objects into Polars DataFrames with native support for molecules and design units.
OEPolars extends Polars with custom extension types that store OpenEye OEMol and OEDesignUnit objects as first-class DataFrame column types. This enables seamless interoperability between OpenEye's cheminformatics capabilities and Polars' high-performance data analysis workflows, including lazy evaluation for large-scale datasets.
Table of Contents
- Installation
- Quick Start
- Basic Usage
- Writing Data
- Parquet Serialization
- API Reference
- Examples
- Development
- License
Installation
Requirements
| Package | Version |
|---|---|
| Python | 3.11+ |
| polars | 1.37.1+ |
| OpenEye Toolkits | 2023.1.0+ |
OpenEye Toolkits License
OpenEye Toolkits requires a commercial license. However, free licenses are available for academic and non-profit institutions. Visit OpenEye Scientific to request an academic license.
Install from PyPI
pip install oepolars
Development Installation
git clone https://github.com/scott-arne/oepolars.git
cd oepolars
pip install -e ".[dev]"
Quick Start
import oepolars as oepl
from openeye import oechem
# Load molecule data from various formats
df = oepl.read_sdf("molecules.sdf")
df = oepl.read_oeb("molecules.oeb.gz")
df = oepl.read_molecule_csv("data.csv", smiles_column="SMILES")
# Use Polars normally with molecules
df = df.with_columns(
num_oxygens=df["Molecule"].map_elements(
lambda mol: oechem.OECount(mol, oechem.OEIsOxygen()),
return_dtype=pl.Int64
)
)
# Generate SMILES strings
smiles = df["Molecule"].chem.to_smiles()
# Filter invalid molecules
df_valid = df.chem.filter_valid("Molecule")
# Write to file
df.chem.write_sdf("output.sdf", molecule_column="Molecule")
Basic Usage
Reading Molecular Data
OEPolars provides readers for all major chemical file formats supported by the OpenEye Toolkits:
import oepolars as oepl
# SDF files - molecules with SD data as columns
df = oepl.read_sdf("molecules.sdf")
# OEB files (binary format, supports conformers)
df = oepl.read_oeb("molecules.oeb.gz")
# SMILES files
df = oepl.read_smi("molecules.smi")
# CSV files with SMILES column
df = oepl.read_molecule_csv("data.csv", smiles_column="SMILES")
# OERecord databases
df = oepl.read_oedb("records.oedb")
# Design unit files (protein-ligand complexes)
df = oepl.read_oedu("complexes.oedu")
# Parquet files with serialized molecules
df = oepl.read_parquet("molecules.parquet", molecule_columns=["Molecule"])
Lazy Reading with Scanners
OEPolars provides lazy scanners for query optimization on large datasets:
import oepolars as oepl
# Lazy reading - operations are optimized before execution
lf = oepl.scan_sdf("large_dataset.sdf")
lf = oepl.scan_oeb("large_dataset.oeb.gz")
lf = oepl.scan_smi("large_dataset.smi")
lf = oepl.scan_molecule_csv("large_dataset.csv", smiles_column="SMILES")
lf = oepl.scan_oedb("records.oedb")
lf = oepl.scan_oedu("complexes.oedu")
lf = oepl.scan_parquet("molecules.parquet", molecule_columns=["Molecule"])
# Apply filters before collecting
result = (
lf
.filter(pl.col("MolWt") > 300)
.select(["Molecule", "Title", "MolWt"])
.collect()
)
Working with Molecules
Once loaded, molecules are stored as MoleculeType columns. Standard Polars operations work seamlessly:
import polars as pl
from openeye import oechem
# Standard Polars operations
filtered_df = df.filter(pl.col("MolWt") > 200)
sorted_df = df.sort("Title")
# Apply OpenEye functions directly
df = df.with_columns(
MW=pl.col("Molecule").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64),
LogP=pl.col("Molecule").map_elements(oechem.OEGetXLogP, return_dtype=pl.Float64),
HBD=pl.col("Molecule").map_elements(
lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()),
return_dtype=pl.Int64
),
)
# Use the .chem accessor for molecular operations
smiles_series = df["Molecule"].chem.to_smiles()
copies = df["Molecule"].chem.copy_molecules()
# Substructure searching with SMARTS
has_carboxylic_acid = df["Molecule"].chem.substructure_search("C(=O)O")
df_acids = df.filter(has_carboxylic_acid)
Design Units
Work with protein-ligand complexes stored as DesignUnitType:
# Read design unit file
df = oepl.read_oedu("protein_ligand_complexes.oedu")
# Extract components using .chem accessor
df = df.with_columns(
Ligand=df["Design_Unit"].chem.get_ligands(),
Protein=df["Design_Unit"].chem.get_proteins(),
)
# Analyze components
df = df.with_columns(
ligand_mw=pl.col("Ligand").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64)
)
# Deep copy design units
df = df.with_columns(
DU_copy=df["Design_Unit"].chem.copy_design_units()
)
Data Quality and Filtering
OEPolars provides methods to check and filter molecule validity:
# Check which molecules are valid
validity = df["Molecule"].chem.is_valid()
print(f"Valid molecules: {validity.sum()}")
print(f"Invalid molecules: {(~validity).sum()}")
# Filter to keep only valid molecules
df_valid = df.chem.filter_valid("Molecule")
# Filter multiple columns at once
df_valid = df.chem.filter_valid(["Molecule", "Product"])
# Add validity as a column for inspection
df = df.with_columns(
is_valid=df["Molecule"].chem.is_valid()
)
Writing Data
Export DataFrames to various molecular file formats using the .chem accessor:
# Write to SDF (columns become SD tags)
df.chem.write_sdf(
"output.sdf",
molecule_column="Molecule",
title_column="Name",
sd_columns=["Activity", "MW"] # Include as SD tags
)
# Write to SMILES file
df.chem.write_smi(
"output.smi",
molecule_column="Molecule",
title_column="Name"
)
# Write to OEB format
df.chem.write_oeb(
"output.oeb",
molecule_column="Molecule",
title_column="Name"
)
# Write to CSV (molecules as SMILES strings)
df.chem.write_molecule_csv(
"output.csv",
molecule_column="Molecule",
smiles_column="SMILES"
)
# Write to OERecord database
df.chem.write_oedb(
"output.oedb",
molecule_column="Molecule"
)
# Write design units
df.chem.write_oedu(
"output.oedu",
design_unit_column="Design_Unit"
)
Parquet Serialization
OEPolars supports Parquet format with automatic molecule serialization, enabling efficient storage and retrieval of molecular data:
# Write to Parquet (molecules serialized to binary OEB format)
df.chem.write_parquet(
"molecules.parquet",
molecule_columns=["Molecule"] # Optional: auto-detects MoleculeType columns
)
# Read from Parquet (reconstruct molecules from binary)
df = oepl.read_parquet(
"molecules.parquet",
molecule_columns=["Molecule"] # Required: specify which columns contain molecules
)
# Lazy reading from Parquet
lf = oepl.scan_parquet(
"molecules.parquet",
molecule_columns=["Molecule"]
)
API Reference
File Readers
read_sdf()
Read SD (Structure Data) files into a DataFrame.
oepl.read_sdf(
filepath,
*,
flavor=oechem.OEIFlavor_SDF_Default,
molecule_column="Molecule",
title_column="Title",
sd_data=True,
usecols=None,
numeric_columns=None
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath |
str, Path | required | Path to SDF file |
flavor |
int | OEIFlavor_SDF_Default |
OpenEye SDF reader flavor |
molecule_column |
str | "Molecule" |
Name of molecule column |
title_column |
str, None | "Title" |
Name of title column (None to skip) |
sd_data |
bool | True |
Read SD data into columns |
usecols |
str, list | None |
SD tags to read (None for all) |
numeric_columns |
str, list | None |
Columns to convert to numeric |
read_oeb()
Read OpenEye Binary (OEB) files into a DataFrame.
oepl.read_oeb(
filepath,
*,
flavor=oechem.OEIFlavor_SDF_Default,
molecule_column="Molecule",
title_column="Title",
sd_data=True,
usecols=None,
numeric_columns=None
)
Parameters same as read_sdf()
read_smi()
Read SMILES files into a DataFrame.
oepl.read_smi(
filepath,
*,
molecule_column="Molecule",
title_column="Title"
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath |
str, Path | required | Path to SMILES file |
molecule_column |
str | "Molecule" |
Name of molecule column |
title_column |
str | "Title" |
Name of title column |
read_molecule_csv()
Read CSV files with molecule columns.
oepl.read_molecule_csv(
filepath,
smiles_column,
*,
molecule_column="Molecule",
drop_smiles=False,
**csv_kwargs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath |
str, Path | required | Path to CSV file |
smiles_column |
str | required | Column containing SMILES strings |
molecule_column |
str | "Molecule" |
Name of new molecule column |
drop_smiles |
bool | False |
Drop original SMILES column |
**csv_kwargs |
Additional arguments passed to pl.read_csv() |
read_oedb()
Read OpenEye Database (OERecord) files into a DataFrame.
oepl.read_oedb(
filepath,
*,
molecule_column="Molecule",
title_column="Title",
sd_data=True,
usecols=None,
numeric_columns=None
)
Parameters same as read_sdf()
read_oedu()
Read Design Unit files into a DataFrame.
oepl.read_oedu(
filepath,
*,
design_unit_column="Design_Unit",
title_column="Title"
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath |
str, Path | required | Path to OEDU file |
design_unit_column |
str | "Design_Unit" |
Name of design unit column |
title_column |
str | "Title" |
Name of title column |
read_parquet()
Read Parquet files with molecule column reconstruction.
oepl.read_parquet(
filepath,
molecule_columns,
**parquet_kwargs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath |
str, Path | required | Path to Parquet file |
molecule_columns |
str, list | required | Column(s) containing serialized molecules |
**parquet_kwargs |
Additional arguments passed to pl.read_parquet() |
File Scanners (Lazy)
All scanners return pl.LazyFrame for query optimization. Parameters match their eager counterparts.
| Scanner | Description |
|---|---|
scan_sdf() |
Lazy reading of SDF files |
scan_oeb() |
Lazy reading of OEB files |
scan_smi() |
Lazy reading of SMILES files |
scan_molecule_csv() |
Lazy reading of CSV with SMILES |
scan_oedb() |
Lazy reading of OEDB files |
scan_oedu() |
Lazy reading of OEDU files |
scan_parquet() |
Lazy reading of Parquet files |
DataFrame Accessor Methods (df.chem.*)
Access these methods via df.chem.<method>():
as_molecule()
Convert column(s) to MoleculeType.
df.chem.as_molecule(
columns,
*,
molecule_format=None
)
| Parameter | Type | Default | Description |
|---|---|---|---|
columns |
str, list | required | Column name(s) to convert |
molecule_format |
str, int | None |
Format for parsing (default: SMILES) |
filter_valid()
Filter rows to keep only those with valid molecules.
df.chem.filter_valid(columns)
| Parameter | Type | Default | Description |
|---|---|---|---|
columns |
str, list | required | MoleculeType column(s) to check |
detect_molecule_columns()
Auto-detect and convert molecule columns based on content.
df.chem.detect_molecule_columns(*, sample_size=25)
write_sdf()
Write DataFrame to SDF file.
df.chem.write_sdf(
filepath,
molecule_column,
*,
title_column=None,
sd_columns=None,
flavor=None
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath |
str, Path | required | Output file path |
molecule_column |
str | required | Column with molecules |
title_column |
str | None |
Column for titles |
sd_columns |
str, list | None |
Columns to include as SD tags |
flavor |
int | None |
OpenEye output flavor |
write_smi()
Write DataFrame to SMILES file.
df.chem.write_smi(
filepath,
molecule_column,
*,
title_column=None,
flavor=None
)
write_oeb()
Write DataFrame to OEB file.
df.chem.write_oeb(
filepath,
molecule_column,
*,
title_column=None,
sd_columns=None
)
write_molecule_csv()
Write DataFrame to CSV with molecules as SMILES strings.
df.chem.write_molecule_csv(
filepath,
molecule_column,
*,
smiles_column="smiles",
smiles_flavor=oechem.OESMILESFlag_ISOMERIC,
drop_molecule=True,
**csv_kwargs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
smiles_column |
str | "smiles" |
Name of SMILES column in output |
smiles_flavor |
int | OESMILESFlag_ISOMERIC |
SMILES generation flavor |
drop_molecule |
bool | True |
Drop molecule column from output |
**csv_kwargs |
Additional arguments passed to CSV writer |
write_oedb()
Write DataFrame to OERecord database.
df.chem.write_oedb(
filepath,
molecule_column,
*,
title_column=None,
sd_columns=None
)
write_oedu()
Write DataFrame to Design Unit file.
df.chem.write_oedu(
filepath,
design_unit_column,
*,
title_column=None
)
write_parquet()
Write DataFrame to Parquet with molecule serialization.
df.chem.write_parquet(
filepath,
molecule_columns=None,
**parquet_kwargs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath |
str, Path | required | Output file path |
molecule_columns |
str, list | None |
Columns to serialize (None auto-detects) |
**parquet_kwargs |
Additional arguments passed to write_parquet() |
LazyFrame Accessor Methods (lf.chem.*)
Access these methods via lf.chem.<method>(). Most operations require .collect() first.
| Method | Returns | Description |
|---|---|---|
has_molecule_columns() |
bool |
Check if LazyFrame has molecule columns |
molecule_columns() |
list[str] |
Get names of molecule columns |
Note: Operations like to_smiles(), substructure_search(), filter_valid(), and as_molecule() raise LazyOperationError on LazyFrames. Use .collect() first or apply these on eager DataFrames.
Series Accessor Methods (series.chem.*)
Access these methods via series.chem.<method>():
Molecule Methods
| Method | Returns | Description |
|---|---|---|
copy_molecules() |
Series[MoleculeType] |
Deep copy all molecules |
is_valid() |
Series[bool] |
Boolean mask of valid (non-null) molecules |
to_smiles(flavor=OESMILESFlag_ISOMERIC) |
Series[str] |
Convert to SMILES strings |
substructure_search(pattern, adjustH=False) |
Series[bool] |
Substructure search with SMARTS pattern |
Design Unit Methods
| Method | Returns | Description |
|---|---|---|
copy_design_units() |
Series[DesignUnitType] |
Deep copy all design units |
get_ligands(clear_titles=False) |
Series[MoleculeType] |
Extract ligand molecules |
get_proteins(clear_titles=False) |
Series[MoleculeType] |
Extract protein molecules |
get_components(mask) |
Series[MoleculeType] |
Extract components by mask |
Extension Types
OEPolars provides three custom Polars extension types:
| Type Class | Extension Name | Underlying Type | Description |
|---|---|---|---|
MoleculeType |
"molecule" |
oechem.OEMol |
Stores molecular structures |
DesignUnitType |
"design_unit" |
oechem.OEDesignUnit |
Stores protein-ligand complexes |
DisplayType |
"display" |
oedepict.OE2DMolDisplay |
Stores 2D molecular depictions |
Examples
Comprehensive Jupyter notebooks are available in the examples/ directory:
- 01_getting_started.ipynb - Basic usage, molecular calculations, data manipulation, validity checking
- 02_advanced_features.ipynb - File I/O, design units, data quality filtering, performance optimization, ML integration
Example: Complete Workflow
import oepolars as oepl
import polars as pl
from openeye import oechem
# 1. Load data
df = oepl.read_sdf("molecules.sdf")
# 2. Filter invalid molecules
df = df.chem.filter_valid("Molecule")
# 3. Calculate properties
df = df.with_columns(
MW=pl.col("Molecule").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64),
LogP=pl.col("Molecule").map_elements(oechem.OEGetXLogP, return_dtype=pl.Float64),
HBD=pl.col("Molecule").map_elements(
lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()),
return_dtype=pl.Int64
),
HBA=pl.col("Molecule").map_elements(
lambda m: oechem.OECount(m, oechem.OEIsHBondAcceptor()),
return_dtype=pl.Int64
),
)
# 4. Filter by Lipinski's Rule of Five
df_druglike = df.filter(
(pl.col("MW") <= 500) &
(pl.col("LogP") <= 5) &
(pl.col("HBD") <= 5) &
(pl.col("HBA") <= 10)
)
# 5. Substructure search for carboxylic acids
has_acid = df_druglike["Molecule"].chem.substructure_search("C(=O)O")
df_acids = df_druglike.filter(has_acid)
# 6. Export results
df_acids.chem.write_sdf(
"druglike_acids.sdf",
molecule_column="Molecule",
title_column="Title",
sd_columns=["MW", "LogP"]
)
Development
Running Tests
invoke test
# or
pytest
Building Package
invoke build
# or
python -m build
Project Structure
oepolars/
├── oepolars/
│ ├── __init__.py # Public API exports
│ ├── exceptions.py # Custom exception hierarchy
│ ├── util.py # Utility functions
│ ├── types/
│ │ ├── __init__.py
│ │ ├── molecule.py # MoleculeType extension
│ │ ├── design_unit.py # DesignUnitType extension
│ │ └── display.py # DisplayType extension
│ ├── io/
│ │ ├── __init__.py
│ │ ├── readers.py # Eager file readers
│ │ └── scanners.py # Lazy file scanners
│ └── namespaces/
│ ├── __init__.py
│ ├── dataframe.py # DataFrame.chem accessor
│ ├── lazyframe.py # LazyFrame.chem accessor
│ └── series.py # Series.chem accessor
├── tests/ # Test suite
├── examples/ # Jupyter notebooks
└── pyproject.toml # Project configuration
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
Scott Arne Johnson
- Email: scott.arne.johnson@gmail.com
Related Projects
- OEPandas - Sister project providing OpenEye integration with Pandas DataFrames
- OpenEye Toolkits - The underlying cheminformatics toolkit
- Polars - Lightning-fast DataFrame library that OEPolars extends
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oepolars-0.3.0-py3-none-any.whl.
File metadata
- Download URL: oepolars-0.3.0-py3-none-any.whl
- Upload date:
- Size: 34.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af1cd08e689033b7c54d38e970def4ed8afd31a350b90375a35322d8d5fd422d
|
|
| MD5 |
93821cd58bf68de57437b8fcf2a0feac
|
|
| BLAKE2b-256 |
dc5da0d3f0e294d28969297a28526720c21ef80c098777826e6779e25b2a82ef
|