Chemistry-aware Polars enabled by OpenEye

These details have not been verified by PyPI

Project description

OEPolars

Deep integration of OpenEye objects into Polars DataFrames with native support for molecules and design units.

OEPolars extends Polars with custom extension types that store OpenEye OEMol and OEDesignUnit objects as first-class DataFrame column types. This enables seamless interoperability between OpenEye's cheminformatics capabilities and Polars' high-performance data analysis workflows, including lazy evaluation for large-scale datasets.

Installation
Quick Start
Basic Usage
Writing Data
Parquet Serialization
API Reference
Examples
Development
License

Installation

Requirements

Package	Version
Python	3.11+
polars	1.37.1+
OpenEye Toolkits	2023.1.0+

OpenEye Toolkits License

OpenEye Toolkits requires a commercial license. However, free licenses are available for academic and non-profit institutions. Visit OpenEye Scientific to request an academic license.

Install from PyPI

pip install oepolars

Development Installation

git clone https://github.com/scott-arne/oepolars.git
cd oepolars
pip install -e ".[dev]"

Quick Start

import oepolars as oepl
from openeye import oechem

# Load molecule data from various formats
df = oepl.read_sdf("molecules.sdf")
df = oepl.read_oeb("molecules.oeb.gz")
df = oepl.read_molecule_csv("data.csv", smiles_column="SMILES")

# Use Polars normally with molecules
df = df.with_columns(
    num_oxygens=df["Molecule"].map_elements(
        lambda mol: oechem.OECount(mol, oechem.OEIsOxygen()),
        return_dtype=pl.Int64
    )
)

# Generate SMILES strings
smiles = df["Molecule"].chem.to_smiles()

# Filter invalid molecules
df_valid = df.chem.filter_valid("Molecule")

# Write to file
df.chem.write_sdf("output.sdf", molecule_column="Molecule")

Basic Usage

Reading Molecular Data

OEPolars provides readers for all major chemical file formats supported by the OpenEye Toolkits:

import oepolars as oepl

# SDF files - molecules with SD data as columns
df = oepl.read_sdf("molecules.sdf")

# OEB files (binary format, supports conformers)
df = oepl.read_oeb("molecules.oeb.gz")

# SMILES files
df = oepl.read_smi("molecules.smi")

# CSV files with SMILES column
df = oepl.read_molecule_csv("data.csv", smiles_column="SMILES")

# OERecord databases
df = oepl.read_oedb("records.oedb")

# Design unit files (protein-ligand complexes)
df = oepl.read_oedu("complexes.oedu")

# Parquet files with serialized molecules
df = oepl.read_parquet("molecules.parquet", molecule_columns=["Molecule"])

Lazy Reading with Scanners

OEPolars provides lazy scanners for query optimization on large datasets:

import oepolars as oepl

# Lazy reading - operations are optimized before execution
lf = oepl.scan_sdf("large_dataset.sdf")
lf = oepl.scan_oeb("large_dataset.oeb.gz")
lf = oepl.scan_smi("large_dataset.smi")
lf = oepl.scan_molecule_csv("large_dataset.csv", smiles_column="SMILES")
lf = oepl.scan_oedb("records.oedb")
lf = oepl.scan_oedu("complexes.oedu")
lf = oepl.scan_parquet("molecules.parquet", molecule_columns=["Molecule"])

# Apply filters before collecting
result = (
    lf
    .filter(pl.col("MolWt") > 300)
    .select(["Molecule", "Title", "MolWt"])
    .collect()
)

Working with Molecules

Once loaded, molecules are stored as MoleculeType columns. Standard Polars operations work seamlessly:

import polars as pl
from openeye import oechem

# Standard Polars operations
filtered_df = df.filter(pl.col("MolWt") > 200)
sorted_df = df.sort("Title")

# Apply OpenEye functions directly
df = df.with_columns(
    MW=pl.col("Molecule").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64),
    LogP=pl.col("Molecule").map_elements(oechem.OEGetXLogP, return_dtype=pl.Float64),
    HBD=pl.col("Molecule").map_elements(
        lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()),
        return_dtype=pl.Int64
    ),
)

# Use the .chem accessor for molecular operations
smiles_series = df["Molecule"].chem.to_smiles()
copies = df["Molecule"].chem.copy_molecules()

# Substructure searching with SMARTS
has_carboxylic_acid = df["Molecule"].chem.substructure_search("C(=O)O")
df_acids = df.filter(has_carboxylic_acid)

Design Units

Work with protein-ligand complexes stored as DesignUnitType:

# Read design unit file
df = oepl.read_oedu("protein_ligand_complexes.oedu")

# Extract components using .chem accessor
df = df.with_columns(
    Ligand=df["Design_Unit"].chem.get_ligands(),
    Protein=df["Design_Unit"].chem.get_proteins(),
)

# Analyze components
df = df.with_columns(
    ligand_mw=pl.col("Ligand").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64)
)

# Deep copy design units
df = df.with_columns(
    DU_copy=df["Design_Unit"].chem.copy_design_units()
)

Data Quality and Filtering

OEPolars provides methods to check and filter molecule validity:

# Check which molecules are valid
validity = df["Molecule"].chem.is_valid()
print(f"Valid molecules: {validity.sum()}")
print(f"Invalid molecules: {(~validity).sum()}")

# Filter to keep only valid molecules
df_valid = df.chem.filter_valid("Molecule")

# Filter multiple columns at once
df_valid = df.chem.filter_valid(["Molecule", "Product"])

# Add validity as a column for inspection
df = df.with_columns(
    is_valid=df["Molecule"].chem.is_valid()
)

Writing Data

Export DataFrames to various molecular file formats using the .chem accessor:

# Write to SDF (columns become SD tags)
df.chem.write_sdf(
    "output.sdf",
    molecule_column="Molecule",
    title_column="Name",
    sd_columns=["Activity", "MW"]  # Include as SD tags
)

# Write to SMILES file
df.chem.write_smi(
    "output.smi",
    molecule_column="Molecule",
    title_column="Name"
)

# Write to OEB format
df.chem.write_oeb(
    "output.oeb",
    molecule_column="Molecule",
    title_column="Name"
)

# Write to CSV (molecules as SMILES strings)
df.chem.write_molecule_csv(
    "output.csv",
    molecule_column="Molecule",
    smiles_column="SMILES"
)

# Write to OERecord database
df.chem.write_oedb(
    "output.oedb",
    molecule_column="Molecule"
)

# Write design units
df.chem.write_oedu(
    "output.oedu",
    design_unit_column="Design_Unit"
)

Parquet Serialization

OEPolars supports Parquet format with automatic molecule serialization, enabling efficient storage and retrieval of molecular data:

# Write to Parquet (molecules serialized to binary OEB format)
df.chem.write_parquet(
    "molecules.parquet",
    molecule_columns=["Molecule"]  # Optional: auto-detects MoleculeType columns
)

# Read from Parquet (reconstruct molecules from binary)
df = oepl.read_parquet(
    "molecules.parquet",
    molecule_columns=["Molecule"]  # Required: specify which columns contain molecules
)

# Lazy reading from Parquet
lf = oepl.scan_parquet(
    "molecules.parquet",
    molecule_columns=["Molecule"]
)

API Reference

File Readers

`read_sdf()`

Read SD (Structure Data) files into a DataFrame.

oepl.read_sdf(
    filepath,
    *,
    flavor=oechem.OEIFlavor_SDF_Default,
    molecule_column="Molecule",
    title_column="Title",
    sd_data=True,
    usecols=None,
    numeric_columns=None
)

Parameter	Type	Default	Description
`filepath`	str, Path	required	Path to SDF file
`flavor`	int	`OEIFlavor_SDF_Default`	OpenEye SDF reader flavor
`molecule_column`	str	`"Molecule"`	Name of molecule column
`title_column`	str, None	`"Title"`	Name of title column (None to skip)
`sd_data`	bool	`True`	Read SD data into columns
`usecols`	str, list	`None`	SD tags to read (None for all)
`numeric_columns`	str, list	`None`	Columns to convert to numeric

`read_oeb()`

Read OpenEye Binary (OEB) files into a DataFrame.

oepl.read_oeb(
    filepath,
    *,
    flavor=oechem.OEIFlavor_SDF_Default,
    molecule_column="Molecule",
    title_column="Title",
    sd_data=True,
    usecols=None,
    numeric_columns=None
)

Parameters same as read_sdf()

`read_smi()`

Read SMILES files into a DataFrame.

oepl.read_smi(
    filepath,
    *,
    molecule_column="Molecule",
    title_column="Title"
)

Parameter	Type	Default	Description
`filepath`	str, Path	required	Path to SMILES file
`molecule_column`	str	`"Molecule"`	Name of molecule column
`title_column`	str	`"Title"`	Name of title column

`read_molecule_csv()`

Read CSV files with molecule columns.

oepl.read_molecule_csv(
    filepath,
    smiles_column,
    *,
    molecule_column="Molecule",
    drop_smiles=False,
    **csv_kwargs
)

Parameter	Type	Default	Description
`filepath`	str, Path	required	Path to CSV file
`smiles_column`	str	required	Column containing SMILES strings
`molecule_column`	str	`"Molecule"`	Name of new molecule column
`drop_smiles`	bool	`False`	Drop original SMILES column
`**csv_kwargs`			Additional arguments passed to `pl.read_csv()`

`read_oedb()`

Read OpenEye Database (OERecord) files into a DataFrame.

oepl.read_oedb(
    filepath,
    *,
    molecule_column="Molecule",
    title_column="Title",
    sd_data=True,
    usecols=None,
    numeric_columns=None
)

Parameters same as read_sdf()

`read_oedu()`

Read Design Unit files into a DataFrame.

oepl.read_oedu(
    filepath,
    *,
    design_unit_column="Design_Unit",
    title_column="Title"
)

Parameter	Type	Default	Description
`filepath`	str, Path	required	Path to OEDU file
`design_unit_column`	str	`"Design_Unit"`	Name of design unit column
`title_column`	str	`"Title"`	Name of title column

`read_parquet()`

Read Parquet files with molecule column reconstruction.

oepl.read_parquet(
    filepath,
    molecule_columns,
    **parquet_kwargs
)

Parameter	Type	Default	Description
`filepath`	str, Path	required	Path to Parquet file
`molecule_columns`	str, list	required	Column(s) containing serialized molecules
`**parquet_kwargs`			Additional arguments passed to `pl.read_parquet()`

File Scanners (Lazy)

All scanners return pl.LazyFrame for query optimization. Parameters match their eager counterparts.

Scanner	Description
`scan_sdf()`	Lazy reading of SDF files
`scan_oeb()`	Lazy reading of OEB files
`scan_smi()`	Lazy reading of SMILES files
`scan_molecule_csv()`	Lazy reading of CSV with SMILES
`scan_oedb()`	Lazy reading of OEDB files
`scan_oedu()`	Lazy reading of OEDU files
`scan_parquet()`	Lazy reading of Parquet files

DataFrame Accessor Methods (`df.chem.*`)

Access these methods via df.chem.<method>():

`as_molecule()`

Convert column(s) to MoleculeType.

df.chem.as_molecule(
    columns,
    *,
    molecule_format=None
)

Parameter	Type	Default	Description
`columns`	str, list	required	Column name(s) to convert
`molecule_format`	str, int	`None`	Format for parsing (default: SMILES)

`filter_valid()`

Filter rows to keep only those with valid molecules.

df.chem.filter_valid(columns)

Parameter	Type	Default	Description
`columns`	str, list	required	MoleculeType column(s) to check

`detect_molecule_columns()`

Auto-detect and convert molecule columns based on content.

df.chem.detect_molecule_columns(*, sample_size=25)

`write_sdf()`

Write DataFrame to SDF file.

df.chem.write_sdf(
    filepath,
    molecule_column,
    *,
    title_column=None,
    sd_columns=None,
    flavor=None
)

Parameter	Type	Default	Description
`filepath`	str, Path	required	Output file path
`molecule_column`	str	required	Column with molecules
`title_column`	str	`None`	Column for titles
`sd_columns`	str, list	`None`	Columns to include as SD tags
`flavor`	int	`None`	OpenEye output flavor

`write_smi()`

Write DataFrame to SMILES file.

df.chem.write_smi(
    filepath,
    molecule_column,
    *,
    title_column=None,
    flavor=None
)

`write_oeb()`

Write DataFrame to OEB file.

df.chem.write_oeb(
    filepath,
    molecule_column,
    *,
    title_column=None,
    sd_columns=None
)

`write_molecule_csv()`

Write DataFrame to CSV with molecules as SMILES strings.

df.chem.write_molecule_csv(
    filepath,
    molecule_column,
    *,
    smiles_column="smiles",
    smiles_flavor=oechem.OESMILESFlag_ISOMERIC,
    drop_molecule=True,
    **csv_kwargs
)

Parameter	Type	Default	Description
`smiles_column`	str	`"smiles"`	Name of SMILES column in output
`smiles_flavor`	int	`OESMILESFlag_ISOMERIC`	SMILES generation flavor
`drop_molecule`	bool	`True`	Drop molecule column from output
`**csv_kwargs`			Additional arguments passed to CSV writer

`write_oedb()`

Write DataFrame to OERecord database.

df.chem.write_oedb(
    filepath,
    molecule_column,
    *,
    title_column=None,
    sd_columns=None
)

`write_oedu()`

Write DataFrame to Design Unit file.

df.chem.write_oedu(
    filepath,
    design_unit_column,
    *,
    title_column=None
)

`write_parquet()`

Write DataFrame to Parquet with molecule serialization.

df.chem.write_parquet(
    filepath,
    molecule_columns=None,
    **parquet_kwargs
)

Parameter	Type	Default	Description
`filepath`	str, Path	required	Output file path
`molecule_columns`	str, list	`None`	Columns to serialize (None auto-detects)
`**parquet_kwargs`			Additional arguments passed to `write_parquet()`

LazyFrame Accessor Methods (`lf.chem.*`)

Access these methods via lf.chem.<method>(). Most operations require .collect() first.

Method	Returns	Description
`has_molecule_columns()`	`bool`	Check if LazyFrame has molecule columns
`molecule_columns()`	`list[str]`	Get names of molecule columns

Note: Operations like to_smiles(), substructure_search(), filter_valid(), and as_molecule() raise LazyOperationError on LazyFrames. Use .collect() first or apply these on eager DataFrames.

Series Accessor Methods (`series.chem.*`)

Access these methods via series.chem.<method>():

Molecule Methods

Method	Returns	Description
`copy_molecules()`	`Series[MoleculeType]`	Deep copy all molecules
`is_valid()`	`Series[bool]`	Boolean mask of valid (non-null) molecules
`to_smiles(flavor=OESMILESFlag_ISOMERIC)`	`Series[str]`	Convert to SMILES strings
`substructure_search(pattern, adjustH=False)`	`Series[bool]`	Substructure search with SMARTS pattern

Design Unit Methods

Method	Returns	Description
`copy_design_units()`	`Series[DesignUnitType]`	Deep copy all design units
`get_ligands(clear_titles=False)`	`Series[MoleculeType]`	Extract ligand molecules
`get_proteins(clear_titles=False)`	`Series[MoleculeType]`	Extract protein molecules
`get_components(mask)`	`Series[MoleculeType]`	Extract components by mask

Extension Types

OEPolars provides three custom Polars extension types:

Type Class	Extension Name	Underlying Type	Description
`MoleculeType`	`"molecule"`	`oechem.OEMol`	Stores molecular structures
`DesignUnitType`	`"design_unit"`	`oechem.OEDesignUnit`	Stores protein-ligand complexes
`DisplayType`	`"display"`	`oedepict.OE2DMolDisplay`	Stores 2D molecular depictions

Examples

Comprehensive Jupyter notebooks are available in the examples/ directory:

01_getting_started.ipynb - Basic usage, molecular calculations, data manipulation, validity checking
02_advanced_features.ipynb - File I/O, design units, data quality filtering, performance optimization, ML integration

Example: Complete Workflow

import oepolars as oepl
import polars as pl
from openeye import oechem

# 1. Load data
df = oepl.read_sdf("molecules.sdf")

# 2. Filter invalid molecules
df = df.chem.filter_valid("Molecule")

# 3. Calculate properties
df = df.with_columns(
    MW=pl.col("Molecule").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64),
    LogP=pl.col("Molecule").map_elements(oechem.OEGetXLogP, return_dtype=pl.Float64),
    HBD=pl.col("Molecule").map_elements(
        lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()),
        return_dtype=pl.Int64
    ),
    HBA=pl.col("Molecule").map_elements(
        lambda m: oechem.OECount(m, oechem.OEIsHBondAcceptor()),
        return_dtype=pl.Int64
    ),
)

# 4. Filter by Lipinski's Rule of Five
df_druglike = df.filter(
    (pl.col("MW") <= 500) &
    (pl.col("LogP") <= 5) &
    (pl.col("HBD") <= 5) &
    (pl.col("HBA") <= 10)
)

# 5. Substructure search for carboxylic acids
has_acid = df_druglike["Molecule"].chem.substructure_search("C(=O)O")
df_acids = df_druglike.filter(has_acid)

# 6. Export results
df_acids.chem.write_sdf(
    "druglike_acids.sdf",
    molecule_column="Molecule",
    title_column="Title",
    sd_columns=["MW", "LogP"]
)

Development

Running Tests

invoke test
# or
pytest

Building Package

invoke build
# or
python -m build

Project Structure

oepolars/
├── oepolars/
│   ├── __init__.py              # Public API exports
│   ├── exceptions.py            # Custom exception hierarchy
│   ├── util.py                  # Utility functions
│   ├── types/
│   │   ├── __init__.py
│   │   ├── molecule.py          # MoleculeType extension
│   │   ├── design_unit.py       # DesignUnitType extension
│   │   └── display.py           # DisplayType extension
│   ├── io/
│   │   ├── __init__.py
│   │   ├── readers.py           # Eager file readers
│   │   └── scanners.py          # Lazy file scanners
│   └── namespaces/
│       ├── __init__.py
│       ├── dataframe.py         # DataFrame.chem accessor
│       ├── lazyframe.py         # LazyFrame.chem accessor
│       └── series.py            # Series.chem accessor
├── tests/                       # Test suite
├── examples/                    # Jupyter notebooks
└── pyproject.toml              # Project configuration

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Scott Arne Johnson

Email: scott.arne.johnson@gmail.com

Related Projects

OEPandas - Sister project providing OpenEye integration with Pandas DataFrames
OpenEye Toolkits - The underlying cheminformatics toolkit
Polars - Lightning-fast DataFrame library that OEPolars extends

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oepolars-0.3.0-py3-none-any.whl (34.8 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file oepolars-0.3.0-py3-none-any.whl.

File metadata

Download URL: oepolars-0.3.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 34.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for oepolars-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af1cd08e689033b7c54d38e970def4ed8afd31a350b90375a35322d8d5fd422d`
MD5	`93821cd58bf68de57437b8fcf2a0feac`
BLAKE2b-256	`dc5da0d3f0e294d28969297a28526720c21ef80c098777826e6779e25b2a82ef`

See more details on using hashes here.

oepolars 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers