Skip to main content

A synthetic data generator for drug discovery machine learning

Project description

synthbiodata

Python Ruff Polars

A Python package for generating synthetic drug discovery data that mimics real-world scenarios using realistic molecular descriptors and target properties.

[!WARNING]

This package generates synthetic data for testing, and educational purposes only.

The data produced does not represent real biological or chemical measurements and should not be used for clinical, regulatory, or production applications.

Features

  • Generate synthetic molecular descriptors with realistic ranges (MW, LogP, TPSA, HBD, HBA, etc.)
  • Simulate target protein families and their properties (GPCR, Kinase, Protease, etc.)
  • Create chemical fingerprints as binary features
  • Calculate binding probabilities based on molecular properties
  • Generate ADME (Absorption, Distribution, Metabolism, Excretion) data
  • Support for both balanced and imbalanced datasets
  • Configurable data generation parameters
  • Polars DataFrame output for efficient data manipulation

Installation

pip install synthbiodata

or with my favourite package manager, uv:

uv pip install synthbiodata

Quick Start

from synthbiodata import generate_sample_data

# Generate molecular descriptor data with default configuration
df = generate_sample_data(data_type="molecular-descriptors")
print(f"Generated {len(df)} samples with {len(df.columns)} features")

# Generate ADME data
df_adme = generate_sample_data(data_type="adme")
print(f"Generated {len(df_adme)} samples with {len(df_adme.columns)} features")

For more control over the data generation process, you can use the configuration system:

# Import the factory functions
from synthbiodata import create_config, generate_sample_data

# For even more control, you can import specific configuration classes
from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.config.schema.v1.adme import ADMEConfig

# Create a custom configuration for molecular descriptors
config = create_config(
    data_type="molecular-descriptors",
    n_samples=1000,
    positive_ratio=0.1,
    imbalanced=True,
    random_state=42
)

# Generate data
df = generate_sample_data(config=config)

# Print results
print(f"Total samples: {len(df)}")
print(f"Features: {len(df.columns) - 1}")  # Exclude target column
print(f"Positive ratio: {df['binds_target'].mean():.1%}")

Reproducible Data Generation

To generate reproducible data, use the random_state parameter:

# Generate first dataset with seed
df1 = generate_sample_data(
    data_type="molecular-descriptors",
    random_state=321
)

# Generate second dataset with same seed - will be identical
df2 = generate_sample_data(
    data_type="molecular-descriptors",
    random_state=321
)

# Verify datasets are identical
assert (df1 == df2).all().all()

# Different seed produces different data
df3 = generate_sample_data(
    data_type="molecular-descriptors",
    random_state=123
)

The random_state parameter ensures:

  • Consistent data generation across runs
  • Reproducible results for testing and validation
  • Easy comparison of model performance

Data Types

Molecular Descriptors

Generate synthetic molecular data with features like:

  • Molecular weight, LogP, TPSA
  • Hydrogen bond donors/acceptors
  • Rotatable bonds, aromatic rings
  • Chemical fingerprints
  • Target protein families (GPCR, Kinase, Protease, etc.)

ADME Data

Generate ADME (Absorption, Distribution, Metabolism, Excretion) data with:

  • Absorption percentages
  • Plasma protein binding
  • Clearance rates and half-life
  • Bioavailability predictions

Package Structure

The package is organized into several modules:

Configuration System (v1)

The configuration system uses versioned schemas for better compatibility:

from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.config.schema.v1.adme import ADMEConfig

Configuration options include:

  • BaseConfig: Common parameters

    • Sample size and random seed
    • Positive ratio for classification
    • Train/validation/test splits
    • Imbalanced dataset settings
  • MolecularConfig: Molecular descriptor settings

    • Molecular weight ranges (min, max, mean, std)
    • LogP and TPSA parameters
    • Target protein families and probabilities
    • Chemical fingerprint options
  • ADMEConfig: ADME-specific parameters

    • Absorption and bioavailability settings
    • Plasma protein binding ranges
    • Clearance and half-life parameters
    • Renal clearance ratios

Data Generation

The package provides factory functions for easy data generation:

from synthbiodata import create_config, generate_sample_data

Key features:

  • Type-safe configuration creation
  • Automatic validation of parameters
  • Support for both balanced and imbalanced datasets
  • Reproducible data generation with random seeds
  • Efficient data output using Polars DataFrames

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthbiodata-0.0.1a0.tar.gz (107.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthbiodata-0.0.1a0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file synthbiodata-0.0.1a0.tar.gz.

File metadata

  • Download URL: synthbiodata-0.0.1a0.tar.gz
  • Upload date:
  • Size: 107.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for synthbiodata-0.0.1a0.tar.gz
Algorithm Hash digest
SHA256 28b8effda30df98990cf33335d1b49f49309d65f143c345045151743f255e94d
MD5 325cab26b9eb4750f69c87c400cc1945
BLAKE2b-256 99eaa6e94090982e4f4bd3b4082979320a5afeeaf8c5b7571213e63732e433d5

See more details on using hashes here.

File details

Details for the file synthbiodata-0.0.1a0-py3-none-any.whl.

File metadata

  • Download URL: synthbiodata-0.0.1a0-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for synthbiodata-0.0.1a0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd5ad7a5d1a944e05cc66542944aef04e2f5c6f6b5ac09a07a48df8ff18afd54
MD5 febe1114c9ce61321ce1ccf1531bf461
BLAKE2b-256 ca8e745896060dacaea57ccfcb9353b4f9757e38811d19a0d74990ebb9993224

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page