Native handling of OpenEye objects in Pandas
Project description
OEPandas
Deep integration of OpenEye objects into Pandas DataFrames with native support for molecules and design units.
OEPandas extends Pandas with custom extension arrays that store OpenEye OEMol and OEDesignUnit objects as first-class DataFrame column types. This enables seamless interoperability between OpenEye's cheminformatics capabilities and Pandas' data analysis workflows.
Table of Contents
Installation
Requirements
| Package | Version |
|---|---|
| Python | 3.11+ |
| pandas | 2.2.0+ |
| numpy | 2.0.0+ |
| OpenEye Toolkits | 2025.2.1+ |
OpenEye Toolkits License
OpenEye Toolkits requires a commercial license. However, free licenses are available for academic and non-profit institutions. Visit OpenEye Scientific to request an academic license.
Install from PyPI
pip install oepandas
Development Installation
git clone https://github.com/scott-arne/oepandas.git
cd oepandas
pip install -e ".[dev]"
Quick Start
import oepandas as oepd
from openeye import oechem
# Load molecule data from various formats
df = oepd.read_sdf("molecules.sdf")
df = oepd.read_oeb("molecules.oeb.gz")
df = oepd.read_molecule_csv("data.csv", molecule_columns="SMILES")
# Use pandas normally with molecules
df["num_oxygens"] = df.Molecule.apply(lambda mol: oechem.OECount(mol, oechem.OEIsOxygen()))
# Generate SMILES strings
smiles = df.Molecule.chem.to_smiles()
# Filter invalid molecules
df_valid = df.chem.filter_valid("Molecule")
# Write to file
df.chem.to_sdf("output.sdf", primary_molecule_column="Molecule")
Basic Usage
Reading Molecular Data
OEPandas provides readers for all major chemical file formats supported by the OpenEye Toolkits:
import oepandas as oepd
# SDF files - molecules with SD data as columns
df = oepd.read_sdf("molecules.sdf")
# OEB files (binary format, supports conformers)
df = oepd.read_oeb("molecules.oeb.gz")
# SMILES files
df = oepd.read_smi("molecules.smi")
# CSV files with SMILES column
df = oepd.read_molecule_csv("data.csv", molecule_columns="SMILES")
# OERecord databases
df = oepd.read_oedb("records.oedb")
# Design unit files (protein-ligand complexes)
df = oepd.read_oedu("complexes.oedu")
Working with Molecules
Once loaded, molecules are stored as MoleculeDtype columns. Standard pandas operations work seamlessly:
from openeye import oechem
# Standard pandas operations
filtered_df = df[df.MolWt > 200]
sorted_df = df.sort_values("Title")
# Apply OpenEye functions directly
df["MW"] = df.Molecule.apply(oechem.OECalculateMolecularWeight)
df["LogP"] = df.Molecule.apply(oechem.OEGetXLogP)
df["HBD"] = df.Molecule.apply(lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()))
# Use the .chem accessor for molecular operations
df["SMILES"] = df.Molecule.chem.to_smiles()
df["MolCopy"] = df.Molecule.chem.copy_molecules()
# Substructure searching with SMARTS
has_carboxylic_acid = df.Molecule.chem.subsearch("C(=O)O")
df_acids = df[has_carboxylic_acid]
Design Units
Work with protein-ligand complexes stored as DesignUnitDtype:
# Read design unit file
df = oepd.read_oedu("protein_ligand_complexes.oedu")
# Extract components using .chem accessor
df["Ligand"] = df.Design_Unit.chem.get_ligands()
df["Protein"] = df.Design_Unit.chem.get_proteins()
# Analyze components
df["ligand_mw"] = df.Ligand.apply(oechem.OECalculateMolecularWeight)
# Deep copy design units
df["DU_copy"] = df.Design_Unit.chem.copy_design_units()
Data Quality and Filtering
OEPandas provides methods to check and filter molecule validity:
# Check which molecules are valid
validity = df.Molecule.chem.is_valid()
print(f"Valid molecules: {validity.sum()}")
print(f"Invalid molecules: {(~validity).sum()}")
# Filter to keep only valid molecules
df_valid = df.chem.filter_valid("Molecule")
# Filter multiple columns at once
df_valid = df.chem.filter_valid(["Molecule", "Product"])
# Add validity as a column for inspection
df["is_valid"] = df.Molecule.chem.is_valid()
Writing Data
Export DataFrames to various molecular file formats using the .chem accessor:
# Write to SDF (columns become SD tags)
df.chem.to_sdf(
"output.sdf",
primary_molecule_column="Molecule",
title_column="Name",
columns=["Activity", "MW"] # Include as SD tags
)
# Write to SMILES file
df.chem.to_smi(
"output.smi",
primary_molecule_column="Molecule",
title_column="Name"
)
# Write to CSV (molecules as SMILES strings)
df.chem.to_molecule_csv(
"output.csv",
molecule_format="smiles"
)
# Write to OERecord database
df.chem.to_oedb(
"output.oedb",
primary_molecule_column="Molecule"
)
API Reference
File Readers
read_sdf()
Read SD (Structure Data) files into a DataFrame.
oepd.read_sdf(
filepath_or_buffer,
*,
flavor=oechem.OEIFlavor_SDF_Default,
molecule_column="Molecule",
title_column="Title",
add_smiles=None,
molecule_columns=None,
usecols=None,
numeric=None,
conformer_test="default",
read_sd_data=True
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath_or_buffer |
str, Path, buffer | required | Path to SDF file or readable buffer |
flavor |
int | OEIFlavor_SDF_Default |
OpenEye SDF reader flavor |
molecule_column |
str | "Molecule" |
Name of molecule column |
title_column |
str, None | "Title" |
Name of title column (None to skip) |
add_smiles |
bool, str, list | None |
Add SMILES column(s) |
molecule_columns |
str, list | None |
Additional columns to convert to molecules |
usecols |
str, list | None |
SD tags to read (None for all) |
numeric |
str, list, dict | None |
Columns to convert to numeric |
conformer_test |
str | "default" |
Conformer combining strategy: "default", "absolute", "absolute_canonical", "isomeric", "omega" |
read_sd_data |
bool | True |
Read SD data into columns |
read_oeb()
Read OpenEye Binary (OEB) files into a DataFrame.
oepd.read_oeb(
filepath_or_buffer,
*,
flavor=oechem.OEIFlavor_SDF_Default,
molecule_column="Molecule",
title_column="Title",
add_smiles=None,
molecule_columns=None,
read_generic_data=True,
read_sd_data=True,
usecols=None,
numeric=None,
conformer_test="default",
combine_tags="prefix",
sd_prefix="SD Tag: ",
generic_prefix="Generic Tag: "
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath_or_buffer |
str, Path, buffer | required | Path to OEB file or readable buffer |
read_generic_data |
bool | True |
Read generic data |
read_sd_data |
bool | True |
Read SD data |
combine_tags |
str | "prefix" |
Tag conflict resolution: "prefix", "prefer_sd", "prefer_generic" |
sd_prefix |
str | "SD Tag: " |
Prefix for SD data columns |
generic_prefix |
str | "Generic Tag: " |
Prefix for generic data columns |
Other parameters same as read_sdf()
read_smi()
Read SMILES files into a DataFrame.
oepd.read_smi(
filepath_or_buffer,
*,
cx=False,
flavor=None,
add_smiles=False,
add_inchi_key=False,
molecule_column="Molecule",
title_column="Title",
smiles_column_name="SMILES",
inchi_key_column_name="InChI Key"
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath_or_buffer |
str, Path | required | Path to SMILES file |
cx |
bool | False |
Read CXSMILES format |
flavor |
int | None |
SMILES flavor |
add_smiles |
bool | False |
Include re-canonicalized SMILES column |
add_inchi_key |
bool | False |
Include InChI Key column |
molecule_column |
str | "Molecule" |
Name of molecule column |
title_column |
str | "Title" |
Name of title column |
smiles_column_name |
str | "SMILES" |
Name of SMILES column |
inchi_key_column_name |
str | "InChI Key" |
Name of InChI Key column |
read_molecule_csv()
Read CSV files with molecule columns.
oepd.read_molecule_csv(
filepath_or_buffer,
molecule_columns,
*,
add_smiles=None,
**kwargs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath_or_buffer |
str, Path, buffer | required | Path to CSV file |
molecule_columns |
str, dict, "detect" | required | Column(s) containing molecules, or "detect" for auto-detection |
add_smiles |
bool, str, list | None |
Add SMILES column(s) |
**kwargs |
Additional arguments passed to pd.read_csv() |
read_oedb()
Read OpenEye Database (OERecord) files into a DataFrame.
oepd.read_oedb(
fp,
*,
usecols=None,
int_na=None
)
| Parameter | Type | Default | Description |
|---|---|---|---|
fp |
str, Path | required | Path to OEDB file |
usecols |
str, list | None |
Columns to read (None for all) |
int_na |
int | None |
Value for integer NaN (None uses float NaN) |
read_oedu()
Read Design Unit files into a DataFrame.
oepd.read_oedu(
filepath_or_buffer,
*,
design_unit_column="Design_Unit",
title_column="Title",
generic_data=True
)
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath_or_buffer |
str, Path | required | Path to OEDU file |
design_unit_column |
str | "Design_Unit" |
Name of design unit column |
title_column |
str | "Title" |
Name of title column |
generic_data |
bool | True |
Read generic data into columns |
DataFrame Accessor Methods (df.chem.*)
Access these methods via df.chem.<method>():
as_molecule()
Convert column(s) to MoleculeDtype.
df.chem.as_molecule(
columns,
*,
molecule_format=None,
inplace=False
)
| Parameter | Type | Default | Description |
|---|---|---|---|
columns |
str, list | required | Column name(s) to convert |
molecule_format |
str, int | None |
Format for parsing (default: SMILES) |
inplace |
bool | False |
Modify DataFrame in place |
as_design_unit()
Convert column(s) to DesignUnitDtype.
df.chem.as_design_unit(columns, *, inplace=False)
filter_valid()
Filter rows to keep only those with valid molecules.
df.chem.filter_valid(columns, *, inplace=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
columns |
str, list | required | MoleculeDtype column(s) to check |
inplace |
bool | False |
Modify DataFrame in place |
detect_molecule_columns()
Auto-detect and convert molecule columns based on predominant type.
df.chem.detect_molecule_columns(*, sample_size=25)
to_sdf()
Write DataFrame to SDF file.
df.chem.to_sdf(
fp,
primary_molecule_column,
*,
title_column=None,
columns=None,
index=True,
index_tag="index",
secondary_molecules_as="smiles",
secondary_molecule_flavor=None,
gzip=False
)
| Parameter | Type | Default | Description |
|---|---|---|---|
fp |
str, Path | required | Output file path |
primary_molecule_column |
str | required | Column with molecules |
title_column |
str | None |
Column for titles |
columns |
str, list | None |
Columns to include as SD tags (None for all) |
index |
bool | True |
Include index as SD tag |
index_tag |
str | "index" |
Name of index SD tag |
secondary_molecules_as |
str, int | "smiles" |
Encoding for other molecule columns |
gzip |
bool | False |
Gzip compress output |
to_smi()
Write DataFrame to SMILES file.
df.chem.to_smi(
fp,
primary_molecule_column,
*,
flavor=None,
molecule_format=oechem.OEFormat_SMI,
title_column=None,
gzip=False
)
to_molecule_csv()
Write DataFrame to CSV with molecules as strings.
df.chem.to_molecule_csv(
fp,
*,
molecule_format="smiles",
flavor=None,
gzip=False,
b64encode=False,
columns=None,
index=True,
sep=',',
**kwargs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
molecule_format |
str, int | "smiles" |
Output format for molecules |
b64encode |
bool | False |
Base64 encode molecule strings |
**kwargs |
Additional arguments passed to pandas CSV writer |
to_oedb()
Write DataFrame to OERecord database.
df.chem.to_oedb(
fp,
primary_molecule_column=None,
*,
title_column=None,
columns=None,
index=True,
index_label="index",
sample_size=25,
safe=True
)
| Parameter | Type | Default | Description |
|---|---|---|---|
primary_molecule_column |
str | None |
Molecule column (None creates OERecord, not OEMolRecord) |
sample_size |
int | 25 |
Sample size for type detection |
safe |
bool | True |
Type check before writing |
Series Accessor Methods (series.chem.*)
Access these methods via series.chem.<method>():
Properties
| Property | Returns | Description |
|---|---|---|
metadata |
dict |
Access metadata dictionary for attaching arbitrary data to a series |
Molecule Methods
| Method | Returns | Description |
|---|---|---|
copy_molecules() |
Series[MoleculeDtype] |
Deep copy all molecules |
is_valid() |
Series[bool] |
Boolean mask of valid molecules |
as_molecule(molecule_format=None) |
Series[MoleculeDtype] |
Convert series to molecules |
to_molecule(molecule_format=None) |
Series[MoleculeDtype] |
Convert from strings to molecules |
to_molecule_bytes(molecule_format=OEFormat_SMI, flavor=None, gzip=False) |
Series[bytes] |
Convert to byte strings |
to_molecule_strings(molecule_format="smiles", flavor=None, gzip=False, b64encode=False) |
Series[str] |
Convert to string representations |
to_smiles(flavor=OESMILESFlag_ISOMERIC) |
Series[str] |
Convert to SMILES strings |
subsearch(pattern, adjustH=False) |
Series[bool] |
Substructure search with SMARTS pattern |
Design Unit Methods
| Method | Returns | Description |
|---|---|---|
copy_design_units() |
Series[DesignUnitDtype] |
Deep copy all design units |
get_ligands(clear_titles=False) |
Series[MoleculeDtype] |
Extract ligand molecules |
get_proteins(clear_titles=False) |
Series[MoleculeDtype] |
Extract protein molecules |
get_components(mask) |
Series[MoleculeDtype] |
Extract components by mask |
as_design_unit() |
Series[DesignUnitDtype] |
Convert series to design units |
Extension Arrays and Dtypes
OEPandas provides three custom Pandas extension types:
| Array Class | Dtype Class | Underlying Type | Description |
|---|---|---|---|
MoleculeArray |
MoleculeDtype |
oechem.OEMol |
Stores molecular structures |
DesignUnitArray |
DesignUnitDtype |
oechem.OEDesignUnit |
Stores protein-ligand complexes |
DisplayArray |
DisplayDtype |
oedepict.OE2DMolDisplay |
Stores 2D molecular depictions |
MoleculeArray Class Methods
# Create from file
arr = MoleculeArray.read_sdf("file.sdf")
arr = MoleculeArray.read_oeb("file.oeb")
arr = MoleculeArray.read_smi("file.smi")
# Create from sequences
arr = MoleculeArray._from_sequence(["CCO", "c1ccccc1"])
arr = MoleculeArray._from_sequence_of_strings(["CCO", "c1ccccc1"])
# Conversion methods
smiles = arr.to_smiles(flavor=OESMILESFlag_ISOMERIC)
strings = arr.to_molecule_strings(molecule_format="sdf")
bytes_arr = arr.to_molecule_bytes(molecule_format=OEFormat_OEB)
# Substructure searching
matches = arr.subsearch("c1ccccc1")
# Utility methods
arr.deepcopy() # Deep copy
arr.valid() # Boolean mask of valid molecules
arr.isna() # Boolean mask of None values
arr.dropna() # Remove None values
arr.fillna(value) # Fill None values
DesignUnitArray Class Methods
# Create from file
arr = DesignUnitArray.read_oedu("file.oedu")
# Extract components (returns MoleculeArray)
ligands = arr.get_ligands(clear_titles=False)
proteins = arr.get_proteins(clear_titles=False)
components = arr.get_components(mask)
# Utility methods
arr.deepcopy() # Deep copy
arr.valid() # Boolean mask of valid design units
Examples
Comprehensive Jupyter notebooks are available in the examples/ directory:
- 01_getting_started.ipynb - Basic usage, molecular calculations, data manipulation, validity checking
- 02_advanced_features.ipynb - File I/O, design units, data quality filtering, performance optimization, ML integration
Example: Complete Workflow
import oepandas as oepd
from openeye import oechem
# 1. Load data
df = oepd.read_sdf("molecules.sdf", add_smiles=True)
# 2. Filter invalid molecules
df = df.chem.filter_valid("Molecule")
# 3. Calculate properties
df["MW"] = df.Molecule.apply(oechem.OECalculateMolecularWeight)
df["LogP"] = df.Molecule.apply(oechem.OEGetXLogP)
df["HBD"] = df.Molecule.apply(lambda m: oechem.OECount(m, oechem.OEIsHBondDonor()))
df["HBA"] = df.Molecule.apply(lambda m: oechem.OECount(m, oechem.OEIsHBondAcceptor()))
# 4. Filter by Lipinski's Rule of Five
lipinski = (
(df.MW <= 500) &
(df.LogP <= 5) &
(df.HBD <= 5) &
(df.HBA <= 10)
)
df_druglike = df[lipinski]
# 5. Substructure search for carboxylic acids
has_acid = df_druglike.Molecule.chem.subsearch("C(=O)O")
df_acids = df_druglike[has_acid]
# 6. Export results
df_acids.chem.to_sdf(
"druglike_acids.sdf",
primary_molecule_column="Molecule",
title_column="Title",
columns=["MW", "LogP"]
)
Development
Running Tests
invoke test
# or
pytest
Building Package
invoke build
# or
python -m build
Project Structure
oepandas/
├── oepandas/
│ ├── __init__.py # Public API exports
│ ├── pandas_extensions.py # Readers, writers, accessors
│ ├── util.py # Utility functions
│ ├── exception.py # Custom exceptions
│ └── arrays/
│ ├── __init__.py
│ ├── base.py # OEExtensionArray base class
│ ├── molecule.py # MoleculeArray, MoleculeDtype
│ ├── design_unit.py # DesignUnitArray, DesignUnitDtype
│ └── display.py # DisplayArray, DisplayDtype
├── tests/ # Test suite
├── examples/ # Jupyter notebooks
└── pyproject.toml # Project configuration
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
Scott Arne Johnson
- Email: scott.arne.johnson@gmail.com
Related Projects
- OpenEye Toolkits - The underlying cheminformatics toolkit
- Pandas - Data analysis library that OEPandas extends
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oepandas-3.1.1-py3-none-any.whl.
File metadata
- Download URL: oepandas-3.1.1-py3-none-any.whl
- Upload date:
- Size: 38.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89f9c87dc979e15b1c2e1770e2b3590f7fb2b8bbd561fdb18872b4f3a017c609
|
|
| MD5 |
2ef2c7025ad62b257ba03746c6431d85
|
|
| BLAKE2b-256 |
394d2cfa23b9b2c07296f162c9df0d4aef72269454edecd3f1214249902646ed
|