A synthetic data generator for drug discovery machine learning
Project description
synthbiodata
A Python package for generating synthetic drug discovery data that mimics real-world scenarios using realistic molecular descriptors and target properties.
[!WARNING]
This package generates synthetic data for testing, and educational purposes only.
The data produced does not represent real biological or chemical measurements and should not be used for clinical, regulatory, or production applications.
Features
- Generate synthetic molecular descriptors with realistic ranges (MW, LogP, TPSA, HBD, HBA, etc.)
- Simulate target protein families and their properties (GPCR, Kinase, Protease, etc.)
- Create chemical fingerprints as binary features
- Calculate binding probabilities based on molecular properties
- Generate ADME (Absorption, Distribution, Metabolism, Excretion) data
- Support for both balanced and imbalanced datasets
- Configurable data generation parameters
- Polars DataFrame output for efficient data manipulation
Installation
pip install synthbiodata
or with my favourite package manager, uv:
uv pip install synthbiodata
Quick Start
from synthbiodata import generate_sample_data
# Generate molecular descriptor data with default configuration
df = generate_sample_data(data_type="molecular-descriptors")
print(f"Generated {len(df)} samples with {len(df.columns)} features")
# Generate ADME data
df_adme = generate_sample_data(data_type="adme")
print(f"Generated {len(df_adme)} samples with {len(df_adme.columns)} features")
For more control over the data generation process, you can use the configuration system:
# Import the factory functions
from synthbiodata import create_config, generate_sample_data
# For even more control, you can import specific configuration classes
from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.config.schema.v1.adme import ADMEConfig
# Create a custom configuration for molecular descriptors
config = create_config(
data_type="molecular-descriptors",
n_samples=1000,
positive_ratio=0.1,
imbalanced=True,
random_state=42
)
# Generate data
df = generate_sample_data(config=config)
# Print results
print(f"Total samples: {len(df)}")
print(f"Features: {len(df.columns) - 1}") # Exclude target column
print(f"Positive ratio: {df['binds_target'].mean():.1%}")
Reproducible Data Generation
To generate reproducible data, use the random_state parameter:
# Generate first dataset with seed
df1 = generate_sample_data(
data_type="molecular-descriptors",
random_state=321
)
# Generate second dataset with same seed - will be identical
df2 = generate_sample_data(
data_type="molecular-descriptors",
random_state=321
)
# Verify datasets are identical
assert (df1 == df2).all().all()
# Different seed produces different data
df3 = generate_sample_data(
data_type="molecular-descriptors",
random_state=123
)
The random_state parameter ensures:
- Consistent data generation across runs
- Reproducible results for testing and validation
- Easy comparison of model performance
Data Types
Molecular Descriptors
Generate synthetic molecular data with features like:
- Molecular weight, LogP, TPSA
- Hydrogen bond donors/acceptors
- Rotatable bonds, aromatic rings
- Chemical fingerprints
- Target protein families (GPCR, Kinase, Protease, etc.)
ADME Data
Generate ADME (Absorption, Distribution, Metabolism, Excretion) data with:
- Absorption percentages
- Plasma protein binding
- Clearance rates and half-life
- Bioavailability predictions
Package Structure
The package is organized into several modules:
Configuration System (v1)
The configuration system uses versioned schemas for better compatibility:
from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.config.schema.v1.adme import ADMEConfig
Configuration options include:
-
BaseConfig: Common parameters
- Sample size and random seed
- Positive ratio for classification
- Train/validation/test splits
- Imbalanced dataset settings
-
MolecularConfig: Molecular descriptor settings
- Molecular weight ranges (min, max, mean, std)
- LogP and TPSA parameters
- Target protein families and probabilities
- Chemical fingerprint options
-
ADMEConfig: ADME-specific parameters
- Absorption and bioavailability settings
- Plasma protein binding ranges
- Clearance and half-life parameters
- Renal clearance ratios
Data Generation
The package provides factory functions for easy data generation:
from synthbiodata import create_config, generate_sample_data
Key features:
- Type-safe configuration creation
- Automatic validation of parameters
- Support for both balanced and imbalanced datasets
- Reproducible data generation with random seeds
- Efficient data output using Polars DataFrames
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synthbiodata-0.0.1a0.tar.gz.
File metadata
- Download URL: synthbiodata-0.0.1a0.tar.gz
- Upload date:
- Size: 107.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28b8effda30df98990cf33335d1b49f49309d65f143c345045151743f255e94d
|
|
| MD5 |
325cab26b9eb4750f69c87c400cc1945
|
|
| BLAKE2b-256 |
99eaa6e94090982e4f4bd3b4082979320a5afeeaf8c5b7571213e63732e433d5
|
File details
Details for the file synthbiodata-0.0.1a0-py3-none-any.whl.
File metadata
- Download URL: synthbiodata-0.0.1a0-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd5ad7a5d1a944e05cc66542944aef04e2f5c6f6b5ac09a07a48df8ff18afd54
|
|
| MD5 |
febe1114c9ce61321ce1ccf1531bf461
|
|
| BLAKE2b-256 |
ca8e745896060dacaea57ccfcb9353b4f9757e38811d19a0d74990ebb9993224
|