Skip to main content

Chemistry-aware Polars enabled by OpenEye

Project description

OEPolars

Python 3.11+ OpenEye Toolkits Polars 1.37+ License: MIT

Deep integration of OpenEye objects into Polars DataFrames with native support for molecules and design units.

OEPolars extends Polars with custom extension types that store OpenEye OEMol and OEDesignUnit objects as first-class DataFrame column types. This enables seamless interoperability between OpenEye's cheminformatics capabilities and Polars' high-performance data analysis workflows, including lazy evaluation for large-scale datasets.


Table of Contents


Installation

Requirements

Package Version
Python 3.11+
polars 1.37.1+
OpenEye Toolkits 2023.1.0+

OpenEye Toolkits License

OpenEye Toolkits requires a commercial license. However, free licenses are available for academic and non-profit institutions. Visit OpenEye Scientific to request an academic license.

Install from PyPI

pip install oepolars

Development Installation

git clone https://github.com/scott-arne/oepolars.git
cd oepolars
pip install -e ".[dev]"

Quick Start

import oepolars as oepl
from openeye import oechem

# Load molecule data from various formats
df = oepl.read_sdf("molecules.sdf")
df = oepl.read_oeb("molecules.oeb.gz")
df = oepl.read_molecule_csv("data.csv", smiles_column="SMILES")

# Use Polars normally with molecules
df = df.with_columns(
    num_oxygens=df["Molecule"].map_elements(
        lambda mol: oechem.OECount(mol, oechem.OEIsOxygen()),
        return_dtype=pl.Int64
    )
)

# Generate SMILES strings
smiles = df["Molecule"].chem.to_smiles()

# Filter invalid molecules
df_valid = df.chem.filter_valid("Molecule")

# Write to file
df.chem.write_sdf("output.sdf", molecule_column="Molecule")

Basic Usage

Reading Molecular Data

OEPolars provides readers for all major chemical file formats supported by the OpenEye Toolkits:

import oepolars as oepl

# SDF files - molecules with SD data as columns
df = oepl.read_sdf("molecules.sdf")

# OEB files (binary format, supports conformers)
df = oepl.read_oeb("molecules.oeb.gz")

# SMILES files
df = oepl.read_smi("molecules.smi")

# CSV files with SMILES column
df = oepl.read_molecule_csv("data.csv", smiles_column="SMILES")

# OERecord databases
df = oepl.read_oedb("records.oedb")

# Design unit files (protein-ligand complexes)
df = oepl.read_oedu("complexes.oedu")

# Parquet files with serialized molecules
df = oepl.read_parquet("molecules.parquet", molecule_columns=["Molecule"])

Lazy Reading with Scanners

OEPolars provides lazy scanners for query optimization on large datasets:

import oepolars as oepl

# Lazy reading - operations are optimized before execution
lf = oepl.scan_sdf("large_dataset.sdf")
lf = oepl.scan_oeb("large_dataset.oeb.gz")
lf = oepl.scan_smi("large_dataset.smi")
lf = oepl.scan_molecule_csv("large_dataset.csv", smiles_column="SMILES")
lf = oepl.scan_oedb("records.oedb")
lf = oepl.scan_oedu("complexes.oedu")
lf = oepl.scan_parquet("molecules.parquet", molecule_columns=["Molecule"])

# Apply filters before collecting
result = (
    lf
    .filter(pl.col("MolWt") > 300)
    .select(["Molecule", "Title", "MolWt"])
    .collect()
)

Working with Molecules

Once loaded, molecules are stored as MoleculeType columns. Standard Polars operations work seamlessly:

import polars as pl
from openeye import oechem

# Standard Polars operations
filtered_df = df.filter(pl.col("MolWt") > 200)
sorted_df = df.sort("Title")

# Apply OpenEye functions directly
df = df.with_columns(
    MW=pl.col("Molecule").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64),
    LogP=pl.col("Molecule").map_elements(oechem.OEGetXLogP, return_dtype=pl.Float64),
    HBD=pl.col("Molecule").map_elements(
        lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()),
        return_dtype=pl.Int64
    ),
)

# Use the .chem accessor for molecular operations
smiles_series = df["Molecule"].chem.to_smiles()
copies = df["Molecule"].chem.copy_molecules()

# Substructure searching with SMARTS
has_carboxylic_acid = df["Molecule"].chem.substructure_search("C(=O)O")
df_acids = df.filter(has_carboxylic_acid)

Design Units

Work with protein-ligand complexes stored as DesignUnitType:

# Read design unit file
df = oepl.read_oedu("protein_ligand_complexes.oedu")

# Extract components using .chem accessor
df = df.with_columns(
    Ligand=df["Design_Unit"].chem.get_ligands(),
    Protein=df["Design_Unit"].chem.get_proteins(),
)

# Analyze components
df = df.with_columns(
    ligand_mw=pl.col("Ligand").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64)
)

# Deep copy design units
df = df.with_columns(
    DU_copy=df["Design_Unit"].chem.copy_design_units()
)

Data Quality and Filtering

OEPolars provides methods to check and filter molecule validity:

# Check which molecules are valid
validity = df["Molecule"].chem.is_valid()
print(f"Valid molecules: {validity.sum()}")
print(f"Invalid molecules: {(~validity).sum()}")

# Filter to keep only valid molecules
df_valid = df.chem.filter_valid("Molecule")

# Filter multiple columns at once
df_valid = df.chem.filter_valid(["Molecule", "Product"])

# Add validity as a column for inspection
df = df.with_columns(
    is_valid=df["Molecule"].chem.is_valid()
)

Writing Data

Export DataFrames to various molecular file formats using the .chem accessor:

# Write to SDF (columns become SD tags)
df.chem.write_sdf(
    "output.sdf",
    molecule_column="Molecule",
    title_column="Name",
    sd_columns=["Activity", "MW"]  # Include as SD tags
)

# Write to SMILES file
df.chem.write_smi(
    "output.smi",
    molecule_column="Molecule",
    title_column="Name"
)

# Write to OEB format
df.chem.write_oeb(
    "output.oeb",
    molecule_column="Molecule",
    title_column="Name"
)

# Write to CSV (molecules as SMILES strings)
df.chem.write_molecule_csv(
    "output.csv",
    molecule_column="Molecule",
    smiles_column="SMILES"
)

# Write to OERecord database
df.chem.write_oedb(
    "output.oedb",
    molecule_column="Molecule"
)

# Write design units
df.chem.write_oedu(
    "output.oedu",
    design_unit_column="Design_Unit"
)

Parquet Serialization

OEPolars supports Parquet format with automatic molecule serialization, enabling efficient storage and retrieval of molecular data:

# Write to Parquet (molecules serialized to binary OEB format)
df.chem.write_parquet(
    "molecules.parquet",
    molecule_columns=["Molecule"]  # Optional: auto-detects MoleculeType columns
)

# Read from Parquet (reconstruct molecules from binary)
df = oepl.read_parquet(
    "molecules.parquet",
    molecule_columns=["Molecule"]  # Required: specify which columns contain molecules
)

# Lazy reading from Parquet
lf = oepl.scan_parquet(
    "molecules.parquet",
    molecule_columns=["Molecule"]
)

API Reference

File Readers

read_sdf()

Read SD (Structure Data) files into a DataFrame.

oepl.read_sdf(
    filepath,
    *,
    flavor=oechem.OEIFlavor_SDF_Default,
    molecule_column="Molecule",
    title_column="Title",
    sd_data=True,
    usecols=None,
    numeric_columns=None
)
Parameter Type Default Description
filepath str, Path required Path to SDF file
flavor int OEIFlavor_SDF_Default OpenEye SDF reader flavor
molecule_column str "Molecule" Name of molecule column
title_column str, None "Title" Name of title column (None to skip)
sd_data bool True Read SD data into columns
usecols str, list None SD tags to read (None for all)
numeric_columns str, list None Columns to convert to numeric

read_oeb()

Read OpenEye Binary (OEB) files into a DataFrame.

oepl.read_oeb(
    filepath,
    *,
    flavor=oechem.OEIFlavor_SDF_Default,
    molecule_column="Molecule",
    title_column="Title",
    sd_data=True,
    usecols=None,
    numeric_columns=None
)

Parameters same as read_sdf()

read_smi()

Read SMILES files into a DataFrame.

oepl.read_smi(
    filepath,
    *,
    molecule_column="Molecule",
    title_column="Title"
)
Parameter Type Default Description
filepath str, Path required Path to SMILES file
molecule_column str "Molecule" Name of molecule column
title_column str "Title" Name of title column

read_molecule_csv()

Read CSV files with molecule columns.

oepl.read_molecule_csv(
    filepath,
    smiles_column,
    *,
    molecule_column="Molecule",
    drop_smiles=False,
    **csv_kwargs
)
Parameter Type Default Description
filepath str, Path required Path to CSV file
smiles_column str required Column containing SMILES strings
molecule_column str "Molecule" Name of new molecule column
drop_smiles bool False Drop original SMILES column
**csv_kwargs Additional arguments passed to pl.read_csv()

read_oedb()

Read OpenEye Database (OERecord) files into a DataFrame.

oepl.read_oedb(
    filepath,
    *,
    molecule_column="Molecule",
    title_column="Title",
    sd_data=True,
    usecols=None,
    numeric_columns=None
)

Parameters same as read_sdf()

read_oedu()

Read Design Unit files into a DataFrame.

oepl.read_oedu(
    filepath,
    *,
    design_unit_column="Design_Unit",
    title_column="Title"
)
Parameter Type Default Description
filepath str, Path required Path to OEDU file
design_unit_column str "Design_Unit" Name of design unit column
title_column str "Title" Name of title column

read_parquet()

Read Parquet files with molecule column reconstruction.

oepl.read_parquet(
    filepath,
    molecule_columns,
    **parquet_kwargs
)
Parameter Type Default Description
filepath str, Path required Path to Parquet file
molecule_columns str, list required Column(s) containing serialized molecules
**parquet_kwargs Additional arguments passed to pl.read_parquet()

File Scanners (Lazy)

All scanners return pl.LazyFrame for query optimization. Parameters match their eager counterparts.

Scanner Description
scan_sdf() Lazy reading of SDF files
scan_oeb() Lazy reading of OEB files
scan_smi() Lazy reading of SMILES files
scan_molecule_csv() Lazy reading of CSV with SMILES
scan_oedb() Lazy reading of OEDB files
scan_oedu() Lazy reading of OEDU files
scan_parquet() Lazy reading of Parquet files

DataFrame Accessor Methods (df.chem.*)

Access these methods via df.chem.<method>():

as_molecule()

Convert column(s) to MoleculeType.

df.chem.as_molecule(
    columns,
    *,
    molecule_format=None
)
Parameter Type Default Description
columns str, list required Column name(s) to convert
molecule_format str, int None Format for parsing (default: SMILES)

filter_valid()

Filter rows to keep only those with valid molecules.

df.chem.filter_valid(columns)
Parameter Type Default Description
columns str, list required MoleculeType column(s) to check

detect_molecule_columns()

Auto-detect and convert molecule columns based on content.

df.chem.detect_molecule_columns(*, sample_size=25)

write_sdf()

Write DataFrame to SDF file.

df.chem.write_sdf(
    filepath,
    molecule_column,
    *,
    title_column=None,
    sd_columns=None,
    flavor=None
)
Parameter Type Default Description
filepath str, Path required Output file path
molecule_column str required Column with molecules
title_column str None Column for titles
sd_columns str, list None Columns to include as SD tags
flavor int None OpenEye output flavor

write_smi()

Write DataFrame to SMILES file.

df.chem.write_smi(
    filepath,
    molecule_column,
    *,
    title_column=None,
    flavor=None
)

write_oeb()

Write DataFrame to OEB file.

df.chem.write_oeb(
    filepath,
    molecule_column,
    *,
    title_column=None,
    sd_columns=None
)

write_molecule_csv()

Write DataFrame to CSV with molecules as SMILES strings.

df.chem.write_molecule_csv(
    filepath,
    molecule_column,
    *,
    smiles_column="smiles",
    smiles_flavor=oechem.OESMILESFlag_ISOMERIC,
    drop_molecule=True,
    **csv_kwargs
)
Parameter Type Default Description
smiles_column str "smiles" Name of SMILES column in output
smiles_flavor int OESMILESFlag_ISOMERIC SMILES generation flavor
drop_molecule bool True Drop molecule column from output
**csv_kwargs Additional arguments passed to CSV writer

write_oedb()

Write DataFrame to OERecord database.

df.chem.write_oedb(
    filepath,
    molecule_column,
    *,
    title_column=None,
    sd_columns=None
)

write_oedu()

Write DataFrame to Design Unit file.

df.chem.write_oedu(
    filepath,
    design_unit_column,
    *,
    title_column=None
)

write_parquet()

Write DataFrame to Parquet with molecule serialization.

df.chem.write_parquet(
    filepath,
    molecule_columns=None,
    **parquet_kwargs
)
Parameter Type Default Description
filepath str, Path required Output file path
molecule_columns str, list None Columns to serialize (None auto-detects)
**parquet_kwargs Additional arguments passed to write_parquet()

LazyFrame Accessor Methods (lf.chem.*)

Access these methods via lf.chem.<method>(). Most operations require .collect() first.

Method Returns Description
has_molecule_columns() bool Check if LazyFrame has molecule columns
molecule_columns() list[str] Get names of molecule columns

Note: Operations like to_smiles(), substructure_search(), filter_valid(), and as_molecule() raise LazyOperationError on LazyFrames. Use .collect() first or apply these on eager DataFrames.


Series Accessor Methods (series.chem.*)

Access these methods via series.chem.<method>():

Molecule Methods

Method Returns Description
copy_molecules() Series[MoleculeType] Deep copy all molecules
is_valid() Series[bool] Boolean mask of valid (non-null) molecules
to_smiles(flavor=OESMILESFlag_ISOMERIC) Series[str] Convert to SMILES strings
substructure_search(pattern, adjustH=False) Series[bool] Substructure search with SMARTS pattern

Design Unit Methods

Method Returns Description
copy_design_units() Series[DesignUnitType] Deep copy all design units
get_ligands(clear_titles=False) Series[MoleculeType] Extract ligand molecules
get_proteins(clear_titles=False) Series[MoleculeType] Extract protein molecules
get_components(mask) Series[MoleculeType] Extract components by mask

Extension Types

OEPolars provides three custom Polars extension types:

Type Class Extension Name Underlying Type Description
MoleculeType "molecule" oechem.OEMol Stores molecular structures
DesignUnitType "design_unit" oechem.OEDesignUnit Stores protein-ligand complexes
DisplayType "display" oedepict.OE2DMolDisplay Stores 2D molecular depictions

Examples

Comprehensive Jupyter notebooks are available in the examples/ directory:

  • 01_getting_started.ipynb - Basic usage, molecular calculations, data manipulation, validity checking
  • 02_advanced_features.ipynb - File I/O, design units, data quality filtering, performance optimization, ML integration

Example: Complete Workflow

import oepolars as oepl
import polars as pl
from openeye import oechem

# 1. Load data
df = oepl.read_sdf("molecules.sdf")

# 2. Filter invalid molecules
df = df.chem.filter_valid("Molecule")

# 3. Calculate properties
df = df.with_columns(
    MW=pl.col("Molecule").map_elements(oechem.OECalculateMolecularWeight, return_dtype=pl.Float64),
    LogP=pl.col("Molecule").map_elements(oechem.OEGetXLogP, return_dtype=pl.Float64),
    HBD=pl.col("Molecule").map_elements(
        lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()),
        return_dtype=pl.Int64
    ),
    HBA=pl.col("Molecule").map_elements(
        lambda m: oechem.OECount(m, oechem.OEIsHBondAcceptor()),
        return_dtype=pl.Int64
    ),
)

# 4. Filter by Lipinski's Rule of Five
df_druglike = df.filter(
    (pl.col("MW") <= 500) &
    (pl.col("LogP") <= 5) &
    (pl.col("HBD") <= 5) &
    (pl.col("HBA") <= 10)
)

# 5. Substructure search for carboxylic acids
has_acid = df_druglike["Molecule"].chem.substructure_search("C(=O)O")
df_acids = df_druglike.filter(has_acid)

# 6. Export results
df_acids.chem.write_sdf(
    "druglike_acids.sdf",
    molecule_column="Molecule",
    title_column="Title",
    sd_columns=["MW", "LogP"]
)

Development

Running Tests

invoke test
# or
pytest

Building Package

invoke build
# or
python -m build

Project Structure

oepolars/
├── oepolars/
│   ├── __init__.py              # Public API exports
│   ├── exceptions.py            # Custom exception hierarchy
│   ├── util.py                  # Utility functions
│   ├── types/
│   │   ├── __init__.py
│   │   ├── molecule.py          # MoleculeType extension
│   │   ├── design_unit.py       # DesignUnitType extension
│   │   └── display.py           # DisplayType extension
│   ├── io/
│   │   ├── __init__.py
│   │   ├── readers.py           # Eager file readers
│   │   └── scanners.py          # Lazy file scanners
│   └── namespaces/
│       ├── __init__.py
│       ├── dataframe.py         # DataFrame.chem accessor
│       ├── lazyframe.py         # LazyFrame.chem accessor
│       └── series.py            # Series.chem accessor
├── tests/                       # Test suite
├── examples/                    # Jupyter notebooks
└── pyproject.toml              # Project configuration

License

This project is licensed under the MIT License. See the LICENSE file for details.


Author

Scott Arne Johnson


Related Projects

  • OEPandas - Sister project providing OpenEye integration with Pandas DataFrames
  • OpenEye Toolkits - The underlying cheminformatics toolkit
  • Polars - Lightning-fast DataFrame library that OEPolars extends

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oepolars-0.3.0-py3-none-any.whl (34.8 kB view details)

Uploaded Python 3

File details

Details for the file oepolars-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: oepolars-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 34.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for oepolars-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af1cd08e689033b7c54d38e970def4ed8afd31a350b90375a35322d8d5fd422d
MD5 93821cd58bf68de57437b8fcf2a0feac
BLAKE2b-256 dc5da0d3f0e294d28969297a28526720c21ef80c098777826e6779e25b2a82ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page