Evaluation metrics for comparing real vs synthetic single-cell RNA data (scGFT)

These details have not been verified by PyPI

Project links

Project description

ScGFT_Evaluator — Full Documentation (Python)

Evaluation library for comparing real single-cell data against synthetic data generated via spectral perturbation (scGFT). It includes 9 statistical metrics covering variance structure, co-expression networks, differential expression, global distribution, and sparsity.

[!NOTE] All metrics that receive matrices expect Genes × Cells format (rows = genes, columns = cells). AnnData objects store Cells × Genes, so the code transposes internally.

Dependencies

import numpy as np
import pandas as pd
from scipy import stats
from scipy.cluster import hierarchy
from scipy.spatial.distance import cdist, pdist, squareform
from sklearn.metrics import adjusted_rand_score
import statsmodels.api as sm
from statsmodels.stats.multitest import multipletests

Additionally, input objects use anndata.AnnData (not a direct import, but that structure is expected).

Individual Metrics

1. variance(mat_real, mat_sint) → `float`

Measures whether gene-level variance is preserved between real and synthetic data.

Parameter	Type	Description
`mat_real`	`np.ndarray` (G×C)	Real expression matrix
`mat_sint`	`np.ndarray` (G×C)	Synthetic expression matrix

Method: Computes the sample variance (ddof=1) for each gene, then calculates Kendall's Tau between both variance vectors.

Interpretation: τ ∈ [-1, 1]. Values close to 1 indicate that the gene variability ranking is preserved.

2. networks(mat_real, mat_sint, k_modules=10) → `float`

Evaluates whether the gene co-expression structure is preserved.

Parameter	Type	Description
`mat_real`	`np.ndarray` (G×C)	Real expression matrix
`mat_sint`	`np.ndarray` (G×C)	Synthetic expression matrix
`k_modules`	`int`	Number of modules (clusters) to form (default: 10)

Method:

Computes the Pearson correlation matrix between genes
Converts to distance: d = 1 - cor
Applies hierarchical clustering (Ward) and cuts into k_modules clusters
Compares assignments using Adjusted Rand Index (ARI)

Interpretation: ARI ∈ [-1, 1]. ARI = 1 → identical clusters. ARI ≈ 0 → random assignments.

3. pairs(mat_real, mat_sint, n_pairs=1000, gene_names=None) → `float`

Overlap of the most correlated gene pairs.

Parameter	Type	Description
`mat_real`	`np.ndarray` (G×C)	Real expression matrix
`mat_sint`	`np.ndarray` (G×C)	Synthetic expression matrix
`n_pairs`	`int`	How many top pairs to extract (default: 1000)
`gene_names`	`list[str]` or `None`	Gene names; if `None`, uses indices

Method: For each matrix, extracts the n_pairs with the highest |correlation| and computes the Simpson Index (intersection / min(|A|, |B|)).

Interpretation: Simpson ∈ [0, 1]. High values → the same gene pairs are highly correlated in both matrices.

4. _limma_approx(mat, covars_df) → `np.ndarray` (indices)

Private helper method. Approximation to limma (R) using gene-wise OLS.

Parameter	Type	Description
`mat`	`np.ndarray` (G×C)	Expression matrix
`covars_df`	`pd.DataFrame`	Design DataFrame. First column: `Group` (0/1). Additional columns: covariates

Method: Fits an OLS model per gene, extracts the p-value for the group coefficient, applies FDR (Benjamini-Hochberg) correction, and returns the indices of genes with p_adj < 0.05.

5. limma(adata_real, adata_sint, genes_top, col_group, group_a, group_b, covariates=None) → `dict`

Differential expression between two groups.

Parameter	Type	Description
`adata_real`	`AnnData`	Real data
`adata_sint`	`AnnData`	Synthetic data
`genes_top`	`list[str]`	List of genes to analyze
`col_group`	`str`	Column in `.obs` with group labels
`group_a`	`str`	Reference group label
`group_b`	`str`	Target group label
`covariates`	`list[str]` or `None`	Additional `.obs` columns for the model

Returns:

{'simpson': float, 'n_real': int, 'n_sint': int}

simpson: Simpson index between the DEG sets from real and synthetic data
n_real / n_sint: Number of DEGs found in each

6. mmd(pca_real, pca_sint, sigma=None, n_perms=100, sample_size=500) → `dict`

Maximum Mean Discrepancy with RBF kernel and permutation test.

Parameter	Type	Description
`pca_real`	`np.ndarray` (C×PCs)	PCA embedding of real data
`pca_sint`	`np.ndarray` (C×PCs)	PCA embedding of synthetic data
`sigma`	`float` or `None`	Kernel bandwidth; if `None`, uses the median of distances
`n_perms`	`int`	Number of permutations for the test (default: 100)
`sample_size`	`int`	Subsampling size (default: 500)

Returns:

{'distance': float, 'mmd2': float, 'sigma': float, 'p_value': float}

Interpretation: distance ≈ 0 and p_value > 0.05 → distributions are statistically indistinguishable.

7. sparsity(mat_real, mat_sint, threshold=0.0) → `dict`

Compares sparsity (percentage of zeros or near-zero values).

Parameter	Type	Description
`mat_real`	`np.ndarray` (G×C)	Real matrix
`mat_sint`	`np.ndarray` (G×C)	Synthetic matrix
`threshold`	`float`	Values ≤ threshold are considered "zero" (default: 0.0)

Returns:

{'pct_zeros_real': float, 'pct_zeros_sint': float, 'absolute_difference': float}

8. jaccard_networks(mat_real, mat_sint, cor_threshold=0.5, n_random=100, gene_names=None) → `float`

Gene-gene network similarity using density-corrected Jaccard index.

Parameter	Type	Description
`mat_real`	`np.ndarray` (G×C)	Real matrix
`mat_sint`	`np.ndarray` (G×C)	Synthetic matrix
`cor_threshold`	`float`
`n_random`	`int`	Permutations for density correction
`gene_names`	`list[str]` or `None`	Gene names

Method: Builds a co-expression graph for each matrix (edge = |cor| ≥ threshold), computes Jaccard, then subtracts the expected Jaccard under random node permutations.

Interpretation: Values > 0 indicate more overlap than expected by chance. Can be negative.

9. mad(mat_real, mat_sint) → `float`

Mean Absolute Difference cell by cell.

Parameter	Type	Description
`mat_real`	`np.ndarray` (G×C)	Real matrix
`mat_sint`	`np.ndarray` (G×C)	Synthetic matrix (same dimensions)

Requirement: Matrices must have exactly the same dimensions (paired, gene by gene and cell by cell).

Interpretation: The lower, the more similar. Only meaningful when synthetic cells are aligned with real ones (e.g., same order).

10. mean_correlation(mat_real, mat_sint, genes_deg=None, gene_names=None) → `dict`

Correlation between mean expression profiles per gene.

Parameter	Type	Description
`mat_real`	`np.ndarray` (G×C)	Real matrix
`mat_sint`	`np.ndarray` (G×C)	Synthetic matrix
`genes_deg`	`set[str]` or `None`	If provided, filters to these genes only
`gene_names`	`list[str]` or `None`	Names corresponding to the rows

Returns:

{'R2': float, 'Pearson': float, 'Spearman': float, 'N_genes': int}

Orchestrator Function: run_all()

ScGFT_Evaluator.run_all(
    adata_real, adata_sint, genes_top,
    col_group, group_a, group_b,
    covariates=None, sparsity_threshold=1e-4
)

Runs all metrics and returns a single-row pd.DataFrame with all columns. Per-group metrics use dynamic column names based on the values of group_a and group_b.

Prerequisites:

AnnData objects must have PCA computed in adata.obsm['X_pca']
The col_group column must exist in adata.obs

Detailed Use Case

Scenario

You have a single-cell RNA dataset with oligodendrocytes. You have generated synthetic data with scGFT by applying a spectral perturbation. You want to evaluate how similar the synthetic data is to the real data.

Step 1: Prepare the Data

import scanpy as sc
from scGFT_Evaluator import ScGFT_Evaluator

# ── Load data ──
adata_real = sc.read_h5ad("oligodendrocytes_real.h5ad")
adata_sint = sc.read_h5ad("oligodendrocytes_synthetic.h5ad")

# ── Check the group column ──
print(adata_real.obs["diagnosis"].value_counts())
# Control        1200
# Parkinson's     800

Step 2: Select Genes of Interest

# Highly variable genes or those used in the perturbation
sc.pp.highly_variable_genes(adata_real, n_top_genes=2000)
genes_top = adata_real.var_names[adata_real.var['highly_variable']].tolist()

# Or use a predefined list
# genes_top = ["MOBP", "PLP1", "MBP", "OLIG1", "OLIG2", ...]

Step 3: Compute PCA (if not already done)

if 'X_pca' not in adata_real.obsm:
    sc.pp.pca(adata_real, n_comps=50)
    
if 'X_pca' not in adata_sint.obsm:
    sc.pp.pca(adata_sint, n_comps=50)

Step 4: Run the Full Evaluation

results = ScGFT_Evaluator.run_all(
    adata_real=adata_real,
    adata_sint=adata_sint,
    genes_top=genes_top,
    col_group="diagnosis",             # column with group labels
    group_a="Control",                 # reference group
    group_b="Parkinson's",             # target group
    covariates=["age", "pmi_hours"],   # covariates for the limma model
    sparsity_threshold=1e-4            # sparsity threshold
)

print(results.T)  # transpose for better readability

Step 5: Interpret the Results

Tau_Control              0.9234   ← Variance preserved in Control ✓
ARI_Control              0.6512   ← Co-expression partially preserved
Jaccard_Control          0.1823   ← Gene-gene network with moderate overlap
Pairs_Control            0.7145   ← Good correlated pairs
Tau_Parkinsons           0.8901   ← Variance preserved in Parkinson's ✓
ARI_Parkinsons           0.5634   ← Co-expression slightly worse
Jaccard_Parkinsons       0.1456   ← Lower network overlap
Pairs_Parkinsons         0.6823   ← Acceptable pairs
MAD                      0.0312   ← Low absolute difference ✓
R2_means                 0.9876   ← Very similar mean profile ✓
Pearson_means            0.9938   
Sparsity_diff            1.2340   ← Sparsity difference ~1.2% ✓
Limma_Simpson            0.4500   ← 45% of DEGs shared
Limma_n_real             234      
Limma_n_sint             189      
MMD_distance             0.0045   ← Distributions very close ✓
MMD_pvalue               0.3200   ← Not significant (p>0.05) ✓

Step 6 (Optional): Run Individual Metrics

# Variance only for a subgroup
mask_ctrl = (adata_real.obs["diagnosis"] == "Control").values
mat_ctrl_real = adata_real.X[mask_ctrl][:, idx_genes].T
mat_ctrl_sint = adata_sint.X[mask_ctrl_sint][:, idx_genes].T

if hasattr(mat_ctrl_real, "toarray"):
    mat_ctrl_real = mat_ctrl_real.toarray()

tau = ScGFT_Evaluator.variance(mat_ctrl_real, mat_ctrl_sint)
print(f"Kendall's Tau (Control): {tau:.4f}")

Step 7 (Optional): With a Completely Different Dataset

# Breast cancer dataset
adata_real_cancer = sc.read_h5ad("breast_cancer_real.h5ad")
adata_sint_cancer = sc.read_h5ad("breast_cancer_synth.h5ad")

results_cancer = ScGFT_Evaluator.run_all(
    adata_real_cancer, adata_sint_cancer, genes_cancer,
    col_group="tumor_stage",
    group_a="Normal",
    group_b="Invasive",
    covariates=["patient_age", "tumor_grade"]
)
# Columns will be: Tau_Normal, ARI_Normal, ..., Tau_Invasive, ARI_Invasive, ...

Dependency Troubleshooting

`ModuleNotFoundError: No module named 'h5py'`

scanpy depends on anndata, which in turn requires h5py. If pip reports it as installed but Python cannot import it, force a reinstall:

pip install --force-reinstall h5py

[!WARNING] Reinstalling h5py may pull numpy up to version 2.x, which breaks packages compiled with numpy 1.x (pandas, pyarrow, torch, etc.), producing errors such as:
AttributeError: _ARRAY_API not found
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x

Solution: pin numpy < 2 after reinstalling h5py:

pip install "numpy==1.26.4" "h5py>=3.7,<4"

`OSError: Could not find/load shared object file 'llvmlite.dll'`

Occurs when llvmlite (a dependency of numba, used by scanpy) is incompatible with the installed numpy version. Solution:

pip install "numpy==1.26.4" "numba==0.60.0" "llvmlite==0.43.0" "h5py>=3.7,<4"

Verified Compatible Versions

Package	Version	Notes
`numpy`	1.26.4	Required < 2.0 by `torch`, `monai`, `synthcity`, etc.
`h5py`	3.16.0	Compatible with numpy 1.26.4
`numba`	0.60.0	Compatible with numpy 1.26.4
`llvmlite`	0.43.0	Required by numba 0.60.0
`scanpy`	1.11.5	Works with the above versions

[!TIP] To verify that all dependencies import correctly:
import h5py, numpy, numba, scanpy
print(h5py.__version__, numpy.__version__, numba.__version__)

Metrics Summary and Ranges

Metric	Range	Ideal	What it evaluates
Tau	[-1, 1]	→ 1	Gene variance ranking
ARI	[-1, 1]	→ 1	Co-expression modules
Pairs (Simpson)	[0, 1]	→ 1	Top correlated pairs
Corrected Jaccard	(-∞, 1]	> 0	Gene-gene network (edges)
MAD	[0, ∞)	→ 0	Mean absolute difference
R² means	[0, 1]	→ 1	Mean expression profile
Sparsity diff	[0, 100]	→ 0	Difference in % zeros
Limma Simpson	[0, 1]	→ 1	Shared DEGs
MMD	[0, ∞)	→ 0	Distributional distance
MMD p-value	[0, 1]	> 0.05	Distribution equality test

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scgft_evaluator-0.1.0.tar.gz (19.0 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scgft_evaluator-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file scgft_evaluator-0.1.0.tar.gz.

File metadata

Download URL: scgft_evaluator-0.1.0.tar.gz
Upload date: Apr 15, 2026
Size: 19.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for scgft_evaluator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`374ca81e21c8c491eb6c5121fcfe5a8af65f63f11adb8fa2eb9903f12c51b424`
MD5	`05a59e90e8f620618e783e2232930d76`
BLAKE2b-256	`cf490a6554172a87e0da78dad914382d279f9bffad9df42fed723703e7206c08`

See more details on using hashes here.

File details

Details for the file scgft_evaluator-0.1.0-py3-none-any.whl.

File metadata

Download URL: scgft_evaluator-0.1.0-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for scgft_evaluator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c36e92c786b28d7777f8c2b58f1f2dd04f0bd02fa365f146d26de729c0b63fb`
MD5	`1d8ad56f24c2993b095d36417b1f0e8c`
BLAKE2b-256	`9875632285c0a3bb180584e1526b9e388c246ee83d136451efeb958f3c481a7c`

See more details on using hashes here.

scgft-evaluator 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScGFT_Evaluator — Full Documentation (Python)

Dependencies

Individual Metrics

1. variance(mat_real, mat_sint) → float

2. networks(mat_real, mat_sint, k_modules=10) → float

3. pairs(mat_real, mat_sint, n_pairs=1000, gene_names=None) → float

4. _limma_approx(mat, covars_df) → np.ndarray (indices)

5. limma(adata_real, adata_sint, genes_top, col_group, group_a, group_b, covariates=None) → dict

6. mmd(pca_real, pca_sint, sigma=None, n_perms=100, sample_size=500) → dict

7. sparsity(mat_real, mat_sint, threshold=0.0) → dict

8. jaccard_networks(mat_real, mat_sint, cor_threshold=0.5, n_random=100, gene_names=None) → float

9. mad(mat_real, mat_sint) → float

10. mean_correlation(mat_real, mat_sint, genes_deg=None, gene_names=None) → dict

Orchestrator Function: run_all()

Detailed Use Case

Scenario

Step 1: Prepare the Data

Step 2: Select Genes of Interest

Step 3: Compute PCA (if not already done)

Step 4: Run the Full Evaluation

Step 5: Interpret the Results

Step 6 (Optional): Run Individual Metrics

Step 7 (Optional): With a Completely Different Dataset

Dependency Troubleshooting

ModuleNotFoundError: No module named 'h5py'

OSError: Could not find/load shared object file 'llvmlite.dll'

Verified Compatible Versions

Metrics Summary and Ranges

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. variance(mat_real, mat_sint) → `float`

2. networks(mat_real, mat_sint, k_modules=10) → `float`

3. pairs(mat_real, mat_sint, n_pairs=1000, gene_names=None) → `float`

4. _limma_approx(mat, covars_df) → `np.ndarray` (indices)

5. limma(adata_real, adata_sint, genes_top, col_group, group_a, group_b, covariates=None) → `dict`

6. mmd(pca_real, pca_sint, sigma=None, n_perms=100, sample_size=500) → `dict`

7. sparsity(mat_real, mat_sint, threshold=0.0) → `dict`

8. jaccard_networks(mat_real, mat_sint, cor_threshold=0.5, n_random=100, gene_names=None) → `float`

9. mad(mat_real, mat_sint) → `float`

10. mean_correlation(mat_real, mat_sint, genes_deg=None, gene_names=None) → `dict`

`ModuleNotFoundError: No module named 'h5py'`

`OSError: Could not find/load shared object file 'llvmlite.dll'`