A synthetic data generator for drug discovery machine learning

These details have not been verified by PyPI

Project links

Project description

synthbiodata

A Python package for generating synthetic drug discovery data that mimics real-world scenarios using realistic molecular descriptors and target properties.

[!WARNING]

This package generates synthetic data for testing, and educational purposes only.

The data produced does not represent real biological or chemical measurements and should not be used for clinical, regulatory, or production applications.

Features

Generate synthetic molecular descriptors with realistic ranges (MW, LogP, TPSA, HBD, HBA, etc.)
Simulate target protein families and their properties (GPCR, Kinase, Protease, etc.)
Create chemical fingerprints as binary features
Calculate binding probabilities based on molecular properties
Generate ADME (Absorption, Distribution, Metabolism, Excretion) data
Support for both balanced and imbalanced datasets
Configurable data generation parameters
Polars DataFrame output for efficient data manipulation

Installation

pip install synthbiodata

or with my favourite package manager, uv:

uv pip install synthbiodata

Quick Start

from synthbiodata import generate_sample_data

# Generate molecular descriptor data with default configuration
df = generate_sample_data(data_type="molecular-descriptors")
print(f"Generated {len(df)} samples with {len(df.columns)} features")

# Generate ADME data
df_adme = generate_sample_data(data_type="adme")
print(f"Generated {len(df_adme)} samples with {len(df_adme.columns)} features")

For more control over the data generation process, you can use the configuration system:

# Import the factory functions
from synthbiodata import create_config, generate_sample_data

# For even more control, you can import specific configuration classes
from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.config.schema.v1.adme import ADMEConfig

# Create a custom configuration for molecular descriptors
config = create_config(
    data_type="molecular-descriptors",
    n_samples=1000,
    positive_ratio=0.1,
    imbalanced=True,
    random_state=42
)

# Generate data
df = generate_sample_data(config=config)

# Print results
print(f"Total samples: {len(df)}")
print(f"Features: {len(df.columns) - 1}")  # Exclude target column
print(f"Positive ratio: {df['binds_target'].mean():.1%}")

Reproducible Data Generation

To generate reproducible data, use the random_state parameter:

# Generate first dataset with seed
df1 = generate_sample_data(
    data_type="molecular-descriptors",
    random_state=321
)

# Generate second dataset with same seed - will be identical
df2 = generate_sample_data(
    data_type="molecular-descriptors",
    random_state=321
)

# Verify datasets are identical
assert (df1 == df2).all().all()

# Different seed produces different data
df3 = generate_sample_data(
    data_type="molecular-descriptors",
    random_state=123
)

The random_state parameter ensures:

Consistent data generation across runs
Reproducible results for testing and validation
Easy comparison of model performance

Data Types

Molecular Descriptors

Generate synthetic molecular data with features like:

Molecular weight, LogP, TPSA
Hydrogen bond donors/acceptors
Rotatable bonds, aromatic rings
Chemical fingerprints
Target protein families (GPCR, Kinase, Protease, etc.)

ADME Data

Generate ADME (Absorption, Distribution, Metabolism, Excretion) data with:

Absorption percentages
Plasma protein binding
Clearance rates and half-life
Bioavailability predictions

Package Structure

The package is organized into several modules:

Configuration System (v1)

The configuration system uses versioned schemas for better compatibility:

from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.config.schema.v1.adme import ADMEConfig

Configuration options include:

BaseConfig: Common parameters
- Sample size and random seed
- Positive ratio for classification
- Train/validation/test splits
- Imbalanced dataset settings
MolecularConfig: Molecular descriptor settings
- Molecular weight ranges (min, max, mean, std)
- LogP and TPSA parameters
- Target protein families and probabilities
- Chemical fingerprint options
ADMEConfig: ADME-specific parameters
- Absorption and bioavailability settings
- Plasma protein binding ranges
- Clearance and half-life parameters
- Renal clearance ratios

Data Generation

The package provides factory functions for easy data generation:

from synthbiodata import create_config, generate_sample_data

Key features:

Type-safe configuration creation
Automatic validation of parameters
Support for both balanced and imbalanced datasets
Reproducible data generation with random seeds
Efficient data output using Polars DataFrames

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1a0 pre-release

Sep 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthbiodata-0.0.1a0.tar.gz (107.0 kB view details)

Uploaded Sep 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

synthbiodata-0.0.1a0-py3-none-any.whl (16.2 kB view details)

Uploaded Sep 8, 2025 Python 3

File details

Details for the file synthbiodata-0.0.1a0.tar.gz.

File metadata

Download URL: synthbiodata-0.0.1a0.tar.gz
Upload date: Sep 8, 2025
Size: 107.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for synthbiodata-0.0.1a0.tar.gz
Algorithm	Hash digest
SHA256	`28b8effda30df98990cf33335d1b49f49309d65f143c345045151743f255e94d`
MD5	`325cab26b9eb4750f69c87c400cc1945`
BLAKE2b-256	`99eaa6e94090982e4f4bd3b4082979320a5afeeaf8c5b7571213e63732e433d5`

See more details on using hashes here.

File details

Details for the file synthbiodata-0.0.1a0-py3-none-any.whl.

File metadata

Download URL: synthbiodata-0.0.1a0-py3-none-any.whl
Upload date: Sep 8, 2025
Size: 16.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for synthbiodata-0.0.1a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dd5ad7a5d1a944e05cc66542944aef04e2f5c6f6b5ac09a07a48df8ff18afd54`
MD5	`febe1114c9ce61321ce1ccf1531bf461`
BLAKE2b-256	`ca8e745896060dacaea57ccfcb9353b4f9757e38811d19a0d74990ebb9993224`

See more details on using hashes here.

synthbiodata 0.0.1a0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

synthbiodata

Features

Installation

Quick Start

Reproducible Data Generation

Data Types

Molecular Descriptors

ADME Data

Package Structure

Configuration System (v1)

Data Generation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes