Native handling of OpenEye objects in Pandas

These details have not been verified by PyPI

Project description

OEPandas

Deep integration of OpenEye objects into Pandas DataFrames with native support for molecules and design units.

OEPandas extends Pandas with custom extension arrays that store OpenEye OEMol and OEDesignUnit objects as first-class DataFrame column types. This enables seamless interoperability between OpenEye's cheminformatics capabilities and Pandas' data analysis workflows.

Installation
Quick Start
Basic Usage
Writing Data
API Reference
Examples
Development
License

Installation

Requirements

Package	Version
Python	3.11+
pandas	2.2.0+
numpy	2.0.0+
OpenEye Toolkits	2025.2.1+

OpenEye Toolkits License

OpenEye Toolkits requires a commercial license. However, free licenses are available for academic and non-profit institutions. Visit OpenEye Scientific to request an academic license.

Install from PyPI

pip install oepandas

Development Installation

git clone https://github.com/scott-arne/oepandas.git
cd oepandas
pip install -e ".[dev]"

Quick Start

import oepandas as oepd
from openeye import oechem

# Load molecule data from various formats
df = oepd.read_sdf("molecules.sdf")
df = oepd.read_oeb("molecules.oeb.gz")
df = oepd.read_molecule_csv("data.csv", molecule_columns="SMILES")

# Use pandas normally with molecules
df["num_oxygens"] = df.Molecule.apply(lambda mol: oechem.OECount(mol, oechem.OEIsOxygen()))

# Generate SMILES strings
smiles = df.Molecule.chem.to_smiles()

# Filter invalid molecules
df_valid = df.chem.filter_valid("Molecule")

# Write to file
df.chem.to_sdf("output.sdf", primary_molecule_column="Molecule")

Basic Usage

Reading Molecular Data

OEPandas provides readers for all major chemical file formats supported by the OpenEye Toolkits:

import oepandas as oepd

# SDF files - molecules with SD data as columns
df = oepd.read_sdf("molecules.sdf")

# OEB files (binary format, supports conformers)
df = oepd.read_oeb("molecules.oeb.gz")

# SMILES files
df = oepd.read_smi("molecules.smi")

# CSV files with SMILES column
df = oepd.read_molecule_csv("data.csv", molecule_columns="SMILES")

# OERecord databases
df = oepd.read_oedb("records.oedb")

# Design unit files (protein-ligand complexes)
df = oepd.read_oedu("complexes.oedu")

Working with Molecules

Once loaded, molecules are stored as MoleculeDtype columns. Standard pandas operations work seamlessly:

from openeye import oechem

# Standard pandas operations
filtered_df = df[df.MolWt > 200]
sorted_df = df.sort_values("Title")

# Apply OpenEye functions directly
df["MW"] = df.Molecule.apply(oechem.OECalculateMolecularWeight)
df["LogP"] = df.Molecule.apply(oechem.OEGetXLogP)
df["HBD"] = df.Molecule.apply(lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()))

# Use the .chem accessor for molecular operations
df["SMILES"] = df.Molecule.chem.to_smiles()
df["MolCopy"] = df.Molecule.chem.copy_molecules()

# Substructure searching with SMARTS
has_carboxylic_acid = df.Molecule.chem.subsearch("C(=O)O")
df_acids = df[has_carboxylic_acid]

Design Units

Work with protein-ligand complexes stored as DesignUnitDtype:

# Read design unit file
df = oepd.read_oedu("protein_ligand_complexes.oedu")

# Extract components using .chem accessor
df["Ligand"] = df.Design_Unit.chem.get_ligands()
df["Protein"] = df.Design_Unit.chem.get_proteins()

# Analyze components
df["ligand_mw"] = df.Ligand.apply(oechem.OECalculateMolecularWeight)

# Deep copy design units
df["DU_copy"] = df.Design_Unit.chem.copy_design_units()

Data Quality and Filtering

OEPandas provides methods to check and filter molecule validity:

# Check which molecules are valid
validity = df.Molecule.chem.is_valid()
print(f"Valid molecules: {validity.sum()}")
print(f"Invalid molecules: {(~validity).sum()}")

# Filter to keep only valid molecules
df_valid = df.chem.filter_valid("Molecule")

# Filter multiple columns at once
df_valid = df.chem.filter_valid(["Molecule", "Product"])

# Add validity as a column for inspection
df["is_valid"] = df.Molecule.chem.is_valid()

Writing Data

Export DataFrames to various molecular file formats using the .chem accessor:

# Write to SDF (columns become SD tags)
df.chem.to_sdf(
    "output.sdf",
    primary_molecule_column="Molecule",
    title_column="Name",
    columns=["Activity", "MW"]  # Include as SD tags
)

# Write to SMILES file
df.chem.to_smi(
    "output.smi",
    primary_molecule_column="Molecule",
    title_column="Name"
)

# Write to CSV (molecules as SMILES strings)
df.chem.to_molecule_csv(
    "output.csv",
    molecule_format="smiles"
)

# Write to OERecord database
df.chem.to_oedb(
    "output.oedb",
    primary_molecule_column="Molecule"
)

API Reference

File Readers

`read_sdf()`

Read SD (Structure Data) files into a DataFrame.

oepd.read_sdf(
    filepath_or_buffer,
    *,
    flavor=oechem.OEIFlavor_SDF_Default,
    molecule_column="Molecule",
    title_column="Title",
    add_smiles=None,
    molecule_columns=None,
    usecols=None,
    numeric=None,
    conformer_test="default",
    read_sd_data=True
)

Parameter	Type	Default	Description
`filepath_or_buffer`	str, Path, buffer	required	Path to SDF file or readable buffer
`flavor`	int	`OEIFlavor_SDF_Default`	OpenEye SDF reader flavor
`molecule_column`	str	`"Molecule"`	Name of molecule column
`title_column`	str, None	`"Title"`	Name of title column (None to skip)
`add_smiles`	bool, str, list	`None`	Add SMILES column(s)
`molecule_columns`	str, list	`None`	Additional columns to convert to molecules
`usecols`	str, list	`None`	SD tags to read (None for all)
`numeric`	str, list, dict	`None`	Columns to convert to numeric
`conformer_test`	str	`"default"`	Conformer combining strategy: "default", "absolute", "absolute_canonical", "isomeric", "omega"
`read_sd_data`	bool	`True`	Read SD data into columns

`read_oeb()`

Read OpenEye Binary (OEB) files into a DataFrame.

oepd.read_oeb(
    filepath_or_buffer,
    *,
    flavor=oechem.OEIFlavor_SDF_Default,
    molecule_column="Molecule",
    title_column="Title",
    add_smiles=None,
    molecule_columns=None,
    read_generic_data=True,
    read_sd_data=True,
    usecols=None,
    numeric=None,
    conformer_test="default",
    combine_tags="prefix",
    sd_prefix="SD Tag: ",
    generic_prefix="Generic Tag: "
)

Parameter	Type	Default	Description
`filepath_or_buffer`	str, Path, buffer	required	Path to OEB file or readable buffer
`read_generic_data`	bool	`True`	Read generic data
`read_sd_data`	bool	`True`	Read SD data
`combine_tags`	str	`"prefix"`	Tag conflict resolution: "prefix", "prefer_sd", "prefer_generic"
`sd_prefix`	str	`"SD Tag: "`	Prefix for SD data columns
`generic_prefix`	str	`"Generic Tag: "`	Prefix for generic data columns

Other parameters same as read_sdf()

`read_smi()`

Read SMILES files into a DataFrame.

oepd.read_smi(
    filepath_or_buffer,
    *,
    cx=False,
    flavor=None,
    add_smiles=False,
    add_inchi_key=False,
    molecule_column="Molecule",
    title_column="Title",
    smiles_column_name="SMILES",
    inchi_key_column_name="InChI Key"
)

Parameter	Type	Default	Description
`filepath_or_buffer`	str, Path	required	Path to SMILES file
`cx`	bool	`False`	Read CXSMILES format
`flavor`	int	`None`	SMILES flavor
`add_smiles`	bool	`False`	Include re-canonicalized SMILES column
`add_inchi_key`	bool	`False`	Include InChI Key column
`molecule_column`	str	`"Molecule"`	Name of molecule column
`title_column`	str	`"Title"`	Name of title column
`smiles_column_name`	str	`"SMILES"`	Name of SMILES column
`inchi_key_column_name`	str	`"InChI Key"`	Name of InChI Key column

`read_molecule_csv()`

Read CSV files with molecule columns.

oepd.read_molecule_csv(
    filepath_or_buffer,
    molecule_columns,
    *,
    add_smiles=None,
    **kwargs
)

Parameter	Type	Default	Description
`filepath_or_buffer`	str, Path, buffer	required	Path to CSV file
`molecule_columns`	str, dict, "detect"	required	Column(s) containing molecules, or "detect" for auto-detection
`add_smiles`	bool, str, list	`None`	Add SMILES column(s)
`**kwargs`			Additional arguments passed to `pd.read_csv()`

`read_oedb()`

Read OpenEye Database (OERecord) files into a DataFrame.

oepd.read_oedb(
    fp,
    *,
    usecols=None,
    int_na=None
)

Parameter	Type	Default	Description
`fp`	str, Path	required	Path to OEDB file
`usecols`	str, list	`None`	Columns to read (None for all)
`int_na`	int	`None`	Value for integer NaN (None uses float NaN)

`read_oedu()`

Read Design Unit files into a DataFrame.

oepd.read_oedu(
    filepath_or_buffer,
    *,
    design_unit_column="Design_Unit",
    title_column="Title",
    generic_data=True
)

Parameter	Type	Default	Description
`filepath_or_buffer`	str, Path	required	Path to OEDU file
`design_unit_column`	str	`"Design_Unit"`	Name of design unit column
`title_column`	str	`"Title"`	Name of title column
`generic_data`	bool	`True`	Read generic data into columns

DataFrame Accessor Methods (`df.chem.*`)

Access these methods via df.chem.<method>():

`as_molecule()`

Convert column(s) to MoleculeDtype.

df.chem.as_molecule(
    columns,
    *,
    molecule_format=None,
    inplace=False
)

Parameter	Type	Default	Description
`columns`	str, list	required	Column name(s) to convert
`molecule_format`	str, int	`None`	Format for parsing (default: SMILES)
`inplace`	bool	`False`	Modify DataFrame in place

`as_design_unit()`

Convert column(s) to DesignUnitDtype.

df.chem.as_design_unit(columns, *, inplace=False)

`filter_valid()`

Filter rows to keep only those with valid molecules.

df.chem.filter_valid(columns, *, inplace=False)

Parameter	Type	Default	Description
`columns`	str, list	required	MoleculeDtype column(s) to check
`inplace`	bool	`False`	Modify DataFrame in place

`detect_molecule_columns()`

Auto-detect and convert molecule columns based on predominant type.

df.chem.detect_molecule_columns(*, sample_size=25)

`to_sdf()`

Write DataFrame to SDF file.

df.chem.to_sdf(
    fp,
    primary_molecule_column,
    *,
    title_column=None,
    columns=None,
    index=True,
    index_tag="index",
    secondary_molecules_as="smiles",
    secondary_molecule_flavor=None,
    gzip=False
)

Parameter	Type	Default	Description
`fp`	str, Path	required	Output file path
`primary_molecule_column`	str	required	Column with molecules
`title_column`	str	`None`	Column for titles
`columns`	str, list	`None`	Columns to include as SD tags (None for all)
`index`	bool	`True`	Include index as SD tag
`index_tag`	str	`"index"`	Name of index SD tag
`secondary_molecules_as`	str, int	`"smiles"`	Encoding for other molecule columns
`gzip`	bool	`False`	Gzip compress output

`to_smi()`

Write DataFrame to SMILES file.

df.chem.to_smi(
    fp,
    primary_molecule_column,
    *,
    flavor=None,
    molecule_format=oechem.OEFormat_SMI,
    title_column=None,
    gzip=False
)

`to_molecule_csv()`

Write DataFrame to CSV with molecules as strings.

df.chem.to_molecule_csv(
    fp,
    *,
    molecule_format="smiles",
    flavor=None,
    gzip=False,
    b64encode=False,
    columns=None,
    index=True,
    sep=',',
    **kwargs
)

Parameter	Type	Default	Description
`molecule_format`	str, int	`"smiles"`	Output format for molecules
`b64encode`	bool	`False`	Base64 encode molecule strings
`**kwargs`			Additional arguments passed to pandas CSV writer

`to_oedb()`

Write DataFrame to OERecord database.

df.chem.to_oedb(
    fp,
    primary_molecule_column=None,
    *,
    title_column=None,
    columns=None,
    index=True,
    index_label="index",
    sample_size=25,
    safe=True
)

Parameter	Type	Default	Description
`primary_molecule_column`	str	`None`	Molecule column (None creates OERecord, not OEMolRecord)
`sample_size`	int	`25`	Sample size for type detection
`safe`	bool	`True`	Type check before writing

Series Accessor Methods (`series.chem.*`)

Access these methods via series.chem.<method>():

Properties

Property	Returns	Description
`metadata`	`dict`	Access metadata dictionary for attaching arbitrary data to a series

Molecule Methods

Method	Returns	Description
`copy_molecules()`	`Series[MoleculeDtype]`	Deep copy all molecules
`is_valid()`	`Series[bool]`	Boolean mask of valid molecules
`as_molecule(molecule_format=None)`	`Series[MoleculeDtype]`	Convert series to molecules
`to_molecule(molecule_format=None)`	`Series[MoleculeDtype]`	Convert from strings to molecules
`to_molecule_bytes(molecule_format=OEFormat_SMI, flavor=None, gzip=False)`	`Series[bytes]`	Convert to byte strings
`to_molecule_strings(molecule_format="smiles", flavor=None, gzip=False, b64encode=False)`	`Series[str]`	Convert to string representations
`to_smiles(flavor=OESMILESFlag_ISOMERIC)`	`Series[str]`	Convert to SMILES strings
`subsearch(pattern, adjustH=False)`	`Series[bool]`	Substructure search with SMARTS pattern

Design Unit Methods

Method	Returns	Description
`copy_design_units()`	`Series[DesignUnitDtype]`	Deep copy all design units
`get_ligands(clear_titles=False)`	`Series[MoleculeDtype]`	Extract ligand molecules
`get_proteins(clear_titles=False)`	`Series[MoleculeDtype]`	Extract protein molecules
`get_components(mask)`	`Series[MoleculeDtype]`	Extract components by mask
`as_design_unit()`	`Series[DesignUnitDtype]`	Convert series to design units

Extension Arrays and Dtypes

OEPandas provides three custom Pandas extension types:

Array Class	Dtype Class	Underlying Type	Description
`MoleculeArray`	`MoleculeDtype`	`oechem.OEMol`	Stores molecular structures
`DesignUnitArray`	`DesignUnitDtype`	`oechem.OEDesignUnit`	Stores protein-ligand complexes
`DisplayArray`	`DisplayDtype`	`oedepict.OE2DMolDisplay`	Stores 2D molecular depictions

MoleculeArray Class Methods

# Create from file
arr = MoleculeArray.read_sdf("file.sdf")
arr = MoleculeArray.read_oeb("file.oeb")
arr = MoleculeArray.read_smi("file.smi")

# Create from sequences
arr = MoleculeArray._from_sequence(["CCO", "c1ccccc1"])
arr = MoleculeArray._from_sequence_of_strings(["CCO", "c1ccccc1"])

# Conversion methods
smiles = arr.to_smiles(flavor=OESMILESFlag_ISOMERIC)
strings = arr.to_molecule_strings(molecule_format="sdf")
bytes_arr = arr.to_molecule_bytes(molecule_format=OEFormat_OEB)

# Substructure searching
matches = arr.subsearch("c1ccccc1")

# Utility methods
arr.deepcopy()        # Deep copy
arr.valid()           # Boolean mask of valid molecules
arr.isna()            # Boolean mask of None values
arr.dropna()          # Remove None values
arr.fillna(value)     # Fill None values

DesignUnitArray Class Methods

# Create from file
arr = DesignUnitArray.read_oedu("file.oedu")

# Extract components (returns MoleculeArray)
ligands = arr.get_ligands(clear_titles=False)
proteins = arr.get_proteins(clear_titles=False)
components = arr.get_components(mask)

# Utility methods
arr.deepcopy()        # Deep copy
arr.valid()           # Boolean mask of valid design units

Examples

Comprehensive Jupyter notebooks are available in the examples/ directory:

01_getting_started.ipynb - Basic usage, molecular calculations, data manipulation, validity checking
02_advanced_features.ipynb - File I/O, design units, data quality filtering, performance optimization, ML integration

Example: Complete Workflow

import oepandas as oepd
from openeye import oechem

# 1. Load data
df = oepd.read_sdf("molecules.sdf", add_smiles=True)

# 2. Filter invalid molecules
df = df.chem.filter_valid("Molecule")

# 3. Calculate properties
df["MW"] = df.Molecule.apply(oechem.OECalculateMolecularWeight)
df["LogP"] = df.Molecule.apply(oechem.OEGetXLogP)
df["HBD"] = df.Molecule.apply(lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()))
df["HBA"] = df.Molecule.apply(lambda m: oechem.OECount(m, oechem.OEIsHBondAcceptor()))

# 4. Filter by Lipinski's Rule of Five
lipinski = (
    (df.MW <= 500) &
    (df.LogP <= 5) &
    (df.HBD <= 5) &
    (df.HBA <= 10)
)
df_druglike = df[lipinski]

# 5. Substructure search for carboxylic acids
has_acid = df_druglike.Molecule.chem.subsearch("C(=O)O")
df_acids = df_druglike[has_acid]

# 6. Export results
df_acids.chem.to_sdf(
    "druglike_acids.sdf",
    primary_molecule_column="Molecule",
    title_column="Title",
    columns=["MW", "LogP"]
)

Development

Running Tests

invoke test
# or
pytest

Building Package

invoke build
# or
python -m build

Project Structure

oepandas/
├── oepandas/
│   ├── __init__.py              # Public API exports
│   ├── pandas_extensions.py     # Readers, writers, accessors
│   ├── util.py                  # Utility functions
│   ├── exception.py             # Custom exceptions
│   └── arrays/
│       ├── __init__.py
│       ├── base.py              # OEExtensionArray base class
│       ├── molecule.py          # MoleculeArray, MoleculeDtype
│       ├── design_unit.py       # DesignUnitArray, DesignUnitDtype
│       └── display.py           # DisplayArray, DisplayDtype
├── tests/                       # Test suite
├── examples/                    # Jupyter notebooks
└── pyproject.toml              # Project configuration

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Scott Arne Johnson

Email: scott.arne.johnson@gmail.com

Related Projects

OpenEye Toolkits - The underlying cheminformatics toolkit
Pandas - Data analysis library that OEPandas extends

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.3.0

May 6, 2026

3.2.6

Apr 23, 2026

3.2.5

Apr 23, 2026

3.2.4

Apr 23, 2026

3.2.3

Apr 22, 2026

3.2.2

Mar 27, 2026

3.2.1

Mar 8, 2026

This version

3.2.0

Mar 3, 2026

3.1.2

Feb 3, 2026

3.1.1

Jan 27, 2026

3.1.0

Jan 20, 2026

2.1.1

Jan 16, 2026

2.1.0

Jan 16, 2026

2.0.1

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oepandas-3.2.0-py3-none-any.whl (46.9 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file oepandas-3.2.0-py3-none-any.whl.

File metadata

Download URL: oepandas-3.2.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 46.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for oepandas-3.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`078cb5130e8f3ed1cf3ef4eaba2705e960ce053c9715befa2c9f095ec80c0005`
MD5	`4f86f502fb3f5b57a783001c0d8631a2`
BLAKE2b-256	`102681f798aff4426c5834293c3696b215adebe3bab0d23f22019be7a1ecdd60`

See more details on using hashes here.

oepandas 3.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

OEPandas

Table of Contents

Installation

Requirements

OpenEye Toolkits License

Install from PyPI

Development Installation

Quick Start

Basic Usage

Reading Molecular Data

Working with Molecules

Design Units

Data Quality and Filtering

Writing Data

API Reference

File Readers

read_sdf()

read_oeb()

read_smi()

read_molecule_csv()

read_oedb()

read_oedu()

DataFrame Accessor Methods (df.chem.*)

as_molecule()

as_design_unit()

filter_valid()

detect_molecule_columns()

to_sdf()

to_smi()

to_molecule_csv()

to_oedb()

Series Accessor Methods (series.chem.*)

Properties

Molecule Methods

Design Unit Methods

Extension Arrays and Dtypes

MoleculeArray Class Methods

DesignUnitArray Class Methods

Examples

Example: Complete Workflow

Development

Running Tests

Building Package

Project Structure

License

Author

Related Projects

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

`read_sdf()`

`read_oeb()`

`read_smi()`

`read_molecule_csv()`

`read_oedb()`

`read_oedu()`

DataFrame Accessor Methods (`df.chem.*`)

`as_molecule()`

`as_design_unit()`

`filter_valid()`

`detect_molecule_columns()`

`to_sdf()`

`to_smi()`

`to_molecule_csv()`

`to_oedb()`

Series Accessor Methods (`series.chem.*`)