Skip to main content

PRISM (Proteomics Reference-Integrated Signal Modeling) - RT-aware normalization for Skyline proteomics data

Project description

Skyline-PRISM

PyPI version Python 3.10+ License: MIT

PRISM (Proteomics Reference-Integrated Signal Modeling) is a Python package for retention time-aware normalization of LC-MS proteomics data exported from Skyline, with robust protein quantification using a number of different methods.

Key Features

  • Robust quantification with Tukey median polish: Can be used for both transition→peptide and peptide→protein rollups - automatically handles outliers without pre-identification
  • Reference-anchored ComBat batch correction: Full implementation of empirical Bayes batch correction with automatic QC evaluation using reference/QC samples
  • Dual-control validation: Uses intra-experiment QC samples to validate that corrections work without overfitting
  • Sample outlier detection: Automatic detection of samples with abnormally low signal (failed injections, degradation) with optional exclusion
  • Flexible protein inference: Multiple strategies for handling shared peptides (all_groups, unique_only, razor)
  • Two-arm normalization pipeline: Separate paths for peptide-level and protein-level output, with batch correction applied at the appropriate level
  • Local RT and Global peptide normalization: Normalizes to the median loess fit of the signal abundance across RT. Evaluated using QC and reference samples.

Installation

pip install skyline-prism

Or for development:

git clone https://github.com/maccoss/skyline-prism
cd skyline-prism
pip install -e ".[dev,viz]"

Input File Formats

PRISM accepts Skyline reports in three formats:

Format Extensions Notes
Parquet .parquet Fastest I/O, smaller files (recommended)
CSV .csv Standard Skyline export format
TSV .tsv, .txt Tab-separated variant

Skyline 24.1+ can export reports directly to parquet via File > Export > Report (select parquet format). Parquet is 2-10x faster to read and 5-10x smaller than CSV equivalents.

Multiple input files can be provided in a single run (from different batches or plates):

# Multiple parquet reports (one per plate/batch)
prism run -i plate1.parquet plate2.parquet plate3.parquet -o output/ -c config.yaml

# Mixed formats from different instruments/labs
prism run -i site1.csv site2.parquet site3.tsv -o output/ -c config.yaml

Column names are automatically matched regardless of whether spaces are present (Fragment Ion vs FragmentIon), underscored (Fragment_Ion), or space-free (invariant export format). This handles all Skyline export variants.

Quick Start

Run the full pipeline

# Single parquet report (recommended format)
prism run -i skyline_report.parquet -o output_dir/ -c config.yaml -m sample_metadata.csv

# Single CSV report
prism run -i skyline_report.csv -o output_dir/ -c config.yaml -m sample_metadata.csv

# Multiple reports (merged and batch-corrected together)
prism run -i plate1.parquet plate2.parquet plate3.parquet -o output/ -c config.yaml -m metadata.csv

# Multiple metadata files are automatically merged
prism run -i data.parquet -o output/ -c config.yaml -m batch1_meta.csv batch2_meta.csv

This produces:

  • corrected_peptides.parquet - Peptide-level quantities (normalized, batch-corrected)
  • corrected_proteins.parquet - Protein-level quantities (normalized, batch-corrected)
  • peptides_rollup.parquet - Raw peptide abundances from transition rollup (before normalization)
  • proteins_raw.parquet - Raw protein abundances from peptide rollup (before normalization)
  • protein_groups.csv - Protein group definitions
  • peptide_residuals.parquet - Median polish residuals (for outlier analysis)
  • metadata.json - Processing metadata and provenance
  • prism_run_YYYYMMDD_HHMMSS.log - Detailed log file with all processing steps and timings
  • qc_report.html - HTML QC report with embedded diagnostic plots
  • qc_plots/ - Directory with QC plots (if save_plots: true)

Scale: Output files contain linear-scale abundances (raw peak area units). The pipeline operates on log2 scale internally but back-transforms to linear before writing output.

Key output columns in protein/peptide parquet files:

  • abundance - Linear-scale abundance (normalized, batch-corrected)
  • abundance_raw - Original linear-scale abundance from Skyline
  • uncertainty - Propagated uncertainty (log2 scale, for downstream analysis)
  • cv_peptides - CV across peptides (protein-level quality metric)
  • n_peptides - Number of peptides used in rollup
  • qc_flag - QC warnings (e.g., low_peptide_count(n), single_peptide_in_sample)

Protein Metadata (parsimony “leading_” semantics)

Protein groups are formed by parsimony. Each group has a canonical representative (“leading protein”) used to label group-level metadata. The following metadata columns appear in corrected_proteins.parquet:

  • protein_group: Protein group ID (PG####)
  • leading_protein: Representative accession for the leading protein
  • leading_name: Representative protein name (from Skyline Protein)
  • leading_uniprot_id: UniProt accession for the leading protein (from Skyline Protein Accession)
  • leading_gene_name: Gene symbol for the leading protein (from Skyline Protein Gene)
  • leading_description: Full protein description for the leading protein (from Skyline Protein)

The leading_ prefix clarifies that these fields describe the group's canonical representative, avoiding ambiguity with member/subsumed proteins. Member/subsumed lists are available in protein_groups.csv.

See SPECIFICATION.md for complete column definitions.

Reproducibility: Re-run from provenance

The metadata.json output contains the complete processing configuration, enabling reproducible re-runs:

# Re-run with exact same parameters on new data
prism run -i new_data.csv -o output2/ --from-provenance output1/metadata.json

# Override specific settings while keeping others from provenance
prism run -i new_data.csv -o output2/ --from-provenance output1/metadata.json -c overrides.yaml

Regenerate QC reports

To regenerate QC reports from an existing PRISM output directory (useful after visualization updates):

prism qc -d output_dir/

This reads the processed parquet files and regenerates the QC report without re-running the full pipeline.

Generate a configuration template

Generate an annotated configuration file template:

# Full template with all options documented
prism config-template -o config.yaml

# Minimal template with only common options
prism config-template --minimal -o config.yaml

Merge multiple Skyline reports

The prism merge command merges multiple reports into a single sorted parquet file without running the full pipeline. Accepts CSV, TSV, and parquet inputs in any combination:

# Merge CSV reports
prism merge report1.csv report2.csv -o unified_data.parquet -m sample_metadata.csv

# Merge parquet reports (faster)
prism merge plate1.parquet plate2.parquet plate3.parquet -o unified_data.parquet

# Mixed formats
prism merge site1.csv site2.parquet -o unified_data.parquet -m metadata.csv

Note: prism run also accepts multiple input files directly and performs the merge internally, so prism merge is mainly useful when you want the merged parquet for downstream use outside PRISM.

Compare rollup methods

Compare peptide CVs between two PRISM runs using different rollup methods. This is useful for evaluating whether library-assisted rollup is helping to reduce interference for specific peptides.

Workflow:

  1. Run PRISM twice with different rollup methods:
# Run 1: Sum method (baseline)
prism run -i data.parquet -o output_sum/ -c config_sum.yaml

# Run 2: Library-assist method
prism run -i data.parquet -o output_lib/ -c config_lib.yaml
  1. Compare the results:
prism compare -1 output_sum/ -2 output_lib/ -o comparison_report.html

Options:

Flag Description Default
-1, --run1 First output directory (baseline, e.g., sum method) Required
-2, --run2 Second output directory (comparison, e.g., library_assist) Required
-o, --output Output HTML report path comparison_report.html
-s, --sample-type Sample type to analyze (reference, qc, or all) reference
-n, --top-n Number of top peptides to show in detail 100
-r, --ranking How to rank peptides: most_improved, most_worsened, highest_cv, largest_difference most_improved
--save-plots Save individual PNG plot files False

Example with all options:

prism compare \
    -1 output_sum/ \
    -2 output_lib/ \
    -o comparison.html \
    --sample-type reference \
    --top-n 50 \
    --ranking most_improved \
    --save-plots

Required files in each output directory:

  • corrected_peptides.parquet - Used for CV calculations
  • merged_data.parquet - Required in run2 for library fitting visualization
  • metadata.json - Method names and sample metadata

Comparison report contents:

  • Summary statistics comparing CV distributions between methods
  • Top N peptides ranked by CV improvement
  • For each peptide: detailed library fitting visualization showing:
    • Raw transition intensities
    • Library-scaled predictions
    • Outlier detection and removal steps
    • R² values for each sample
    • Final abundance comparison

This helps identify peptides where library-assisted rollup successfully removes interfered transitions and improves quantification precision.

QC report contents

The QC report includes:

  • Intensity distribution plots: Before/after normalization comparison
  • PCA analysis: Visualize batch effects and sample clustering across processing stages
  • Control sample correlation: Heatmaps showing reproducibility of reference and QC samples
  • CV distributions: Technical variation in control samples before/after normalization
  • RT correction QC: If RT-dependent normalization is enabled, shows residuals for reference (fitted) and QC (held-out validation) samples

All plots are saved as PNGs in output_dir/qc_plots/ and embedded in the HTML report.

Processing Pipeline

The pipeline follows a two-arm design with batch correction applied at the reporting level:

Stage 1: Merge reports (CSV/TSV/parquet, streaming)
        │
        ▼
Stage 2: Transition → Peptide rollup (Tukey median polish)
        │
        ▼
Stage 2b: Peptide Global Normalization (median or VSN)
        │  [Optional: RT-aware correction - disabled by default]
        ▼
Stage 2c: Peptide ComBat Batch Correction
        │
        ├─────────────────────────────────┐
        │                                 │
        ▼                                 ▼
Stage 3: Protein Parsimony      PEPTIDE OUTPUT ARM
        │                       (corrected_peptides.parquet)
        ▼
Stage 4: Peptide → Protein Rollup (median polish)
        │
        ▼
Stage 4b: Protein Global Normalization (median)
        │
        ▼
Stage 4c: Protein ComBat Batch Correction
        │
        ▼
Stage 5: Output Generation
        │
        ▼
    PROTEIN OUTPUT ARM
(corrected_proteins.parquet)
        │
        ▼
Stage 5b: QC Report Generation (HTML + plots)

Key Implementation Notes:

  • Streaming CSV merge handles large datasets (~47GB tested) without loading into memory
  • Merge-and-sort in single pass using DuckDB for efficient multi-file processing
  • Vectorized library-assisted rollup using median polish (~10x speedup on large datasets)
  • Pre-sorted optimization skips redundant sorting when data is already sorted
  • Both peptide and protein outputs receive independent batch correction
  • Automatic log files saved to output directory with timestamp for reproducibility
  • RT correction disabled by default - search engine calibration may not generalize between samples
  • Protein parsimony applied before protein rollup to avoid double-counting shared peptides

Experimental Design Requirements

For optimal results, include these controls in each batch:

Control Type Description Replicates/Batch
Inter-experiment reference Commercial pool (e.g., Golden West CSF) 1-8
Intra-experiment QC Pooled experimental samples 1-8

Note: In 96-well plate formats, controls are typically placed once per row (8 replicates per batch). Smaller experiments may have as few as 1 replicate per batch.

Plus internal QCs in all samples:

  • Protein QC: Yeast enolase (16 ng/μg sample)
  • Peptide QC: PRTC (30-150 fmol/injection)

Configuration

Generate a configuration template with all options documented:

prism config-template -o config.yaml           # Full template with all options
prism config-template --minimal -o config.yaml  # Minimal template

Key settings:

# Sample type detection patterns (for automatic sample type assignment)
sample_annotations:
  reference_pattern:  # Inter-experiment reference (e.g., commercial pools)
    - "-Pool_"
    - "-Pool"
  qc_pattern:       # Intra-experiment QC (e.g., pooled experimental samples)
    - "-Carl_"
    - "-Carl"
  # Everything else is treated as experimental

# Transition to peptide rollup (if using transition-level data)
transition_rollup:
  enabled: true
  method: "adaptive"  # adaptive (recommended), median_polish, or sum
  min_transitions: 3
  learn_adaptive_weights: true  # Learn optimal weights from reference samples (default when method=adaptive)

# Sample outlier detection (one-sided, low signal only)
sample_outlier_detection:
  enabled: true
  action: "report"  # "report" to log only, "exclude" to remove from analysis
  method: "iqr"     # IQR-based on LINEAR scale (not log)
  iqr_multiplier: 1.5

# Global normalization (applied after peptide rollup)
global_normalization:
  method: "median"  # median (recommended) or vsn

# RT correction (optional, applied together with global normalization)
rt_correction:
  enabled: false  # DISABLED BY DEFAULT - search engine RT calibration may not generalize
  method: "spline"
  spline_df: 5
  per_batch: true

# Batch correction (applied at both peptide and protein levels)
batch_correction:
  enabled: true
  method: "combat"  # Full empirical Bayes implementation

# Protein parsimony (applied before protein rollup)
parsimony:
  enabled: true
  fasta_path: "/path/to/database.fasta"  # FASTA used in search
  shared_peptide_handling: "all_groups"  # recommended

# Protein rollup (peptide → protein aggregation)
protein_rollup:
  method: "sum"  # default; alternatives: median_polish, topn, ibaq, maxlfq
  min_peptides: 3  # Minimum peptides for method application (below uses sum)

# QC report generation
qc_report:
  enabled: true
  save_plots: true    # Save individual PNG files
  embed_plots: true   # Embed plots in HTML report

Automatic Batch Estimation

If batch information is not provided in the metadata file, PRISM can automatically estimate batch assignments from acquisition times. Three estimation methods are available:

  • auto (default): Try gap detection first, fall back to fixed if no gaps found
  • fixed: Divide samples evenly into N batches by acquisition time order
  • gap: Only use gap detection (skip batch correction if no gaps found)
batch_estimation:
  method: "auto"           # "auto", "fixed", or "gap"
  n_batches: null          # Number of batches for "fixed" mode (or auto fallback)
  gap_iqr_multiplier: 1.5  # For gap detection: lower = more sensitive

Example (divide into 5 artificial batches for drift correction):

batch_estimation:
  method: "fixed"
  n_batches: 5

To enable time-based batch estimation, include Result File > Acquired Time in your Skyline report. See SPECIFICATION.md for details.

Sample Outlier Detection

PRISM automatically detects samples with abnormally low signal intensity, which may indicate:

  • Failed injections
  • Sample degradation
  • Instrument issues
  • Empty wells

Key features:

  • One-sided detection: Only flags samples with too little signal (not high outliers)
  • Linear scale statistics: Detection uses linear scale (not log2) to avoid scale compression
  • Two detection methods:
    • iqr: Flag samples below Q1 - 1.5×IQR (default)
    • fold_median: Flag samples with median < X% of overall median
  • Configurable action: Report only or exclude from analysis
sample_outlier_detection:
  enabled: true
  action: "report"     # "report" or "exclude"
  method: "iqr"        # "iqr" or "fold_median"
  iqr_multiplier: 1.5  # Samples below Q1 - 1.5*IQR flagged

Example log output:

Sample outlier detection (IQR method, linear scale):
  Q1=1234567, Q3=2345678, IQR=1111111
  Lower bound: 567890 (Q1 - 1.5*IQR)
  Overall median: 1789012
  OUTLIER: Sample_042 - median=24 (0.0% of overall median)

Tukey Median Polish (Default Rollup Method)

Both transition→peptide and peptide→protein rollups use Tukey median polish by default. This robust method:

  • Automatically handles outliers: Interfered transitions or problematic peptides are downweighted through the median operation
  • No pre-filtering required: Unlike quality-weighted methods, doesn't require explicit quality metrics
  • Produces interpretable effects: Separates row effects from column effects (sample abundance)
  • Handles missing values naturally: Works with incomplete matrices

What the row effects represent:

Rollup Stage Row Effects Represent Column Effects Represent
Transition → Peptide Transition interference (co-eluting analytes affecting specific transitions) Peptide abundance per sample
Peptide → Protein Peptide ionization efficiency (some peptides ionize better than others) Protein abundance per sample

Alternative Rollup Methods

While median polish is recommended, these alternative methods are available:

Transition → Peptide Rollup (when using transition-level data):

Method Description Use Case
sum Simple sum of intensities Default, fast, straightforward
consensus Inverse-variance weighted sum Down-weights transitions that deviate from the cross-sample pattern
median_polish Tukey median polish Robust to outliers, produces residuals for QC
adaptive Learned weighted average When quality metrics (intensity, m/z, ShapeCorrelation) are available
topn Top N by correlation/intensity When a subset of transitions are known to be reliable
library_assist Spectral library-based rollup When you have a high-quality spectral library and want to detect/exclude interfered transitions

Peptide → Protein Rollup:

Method Description Use Case
median_polish Tukey median polish (default) General use, robust to outliers
topn Mean of top N most intense peptides Simple, when top peptides are reliable
ibaq iBAQ (intensity / theoretical peptide count) Absolute quantification across proteins
maxlfq Maximum LFQ (MaxQuant-style) When peptide ratios are more reliable than absolute values
sum Simple sum of intensities Fast but sensitive to outliers

min_peptides Threshold:

The min_peptides setting controls the minimum number of peptides required for full-confidence method application:

protein_rollup:
  method: "median_polish"
  min_peptides: 3  # Default: 3
Peptide Count Method Used low_confidence flag
1 Direct peptide value True
< min_peptides Sum (linear space) True
>= min_peptides Configured method False

Proteins below the threshold are still quantified using sum in linear space (consistent with median_polish output scale) and flagged with low_confidence=True.

iBAQ (Intensity-Based Absolute Quantification)

iBAQ normalizes protein abundances by the number of theoretical peptides, enabling comparison of absolute abundance across proteins:

protein_rollup:
  method: "ibaq"
  ibaq:
    fasta_path: "/path/to/database.fasta"  # Required for iBAQ
    enzyme: "trypsin"                       # Must match search settings
    missed_cleavages: 0                     # Typically 0 for counting
from skyline_prism.fasta import get_theoretical_peptide_counts

# Calculate theoretical peptide counts for iBAQ
counts = get_theoretical_peptide_counts(
    "/path/to/database.fasta",
    enzyme="trypsin",
    missed_cleavages=0,
)

Adaptive Rollup with Learned Weights

For transition→peptide rollup, the adaptive method learns optimal weighting parameters from reference samples to minimize CV:

transition_rollup:
  method: "adaptive"
  learn_adaptive_weights: true
  adaptive_rollup:
    beta_mz: 0.0                  # Starting value (optimized automatically)
    beta_shape_corr_outlier: 0.0  # Starting value (optimized automatically)
    shape_corr_low_threshold: 0.7 # Threshold for "low" shape correlation
    min_improvement_pct: 0.1      # Required improvement over sum to use adaptive weights

The weight formula:

w_t = exp(beta_mz * normalized_mz + beta_shape_outlier * outlier_frac)

Where:

  • normalized_mz: m/z normalized to [0,1] range
  • outlier_frac: Fraction of samples where shape correlation < threshold (indicates interference)

Key insight: When all betas are 0, all weights equal 1 (simple sum baseline). The optimizer only uses learned weights if they improve CV.

What gets optimized:

  • beta_mz: Higher m/z fragments may have better signal (positive = favor high m/z)
  • beta_shape_corr_outlier: Transitions with frequent interference (low shape correlation) should be down-weighted (negative = penalize)

The peptide abundance calculation:

Peptide_abundance = log2(Σ weight_t × intensity_t)

The transition intensities are the VALUES being summed. The learned weights adjust how much each transition contributes based on its quality metrics (m/z and interference level).

Learning process:

  1. Parameters are optimized on reference samples by minimizing median CV
  2. Results are validated on QC samples (held-out) to prevent overfitting
  3. Automatic fallback to simple sum if adaptive doesn't improve CV by min_improvement_pct

Consensus Rollup with Cross-Sample Weighting

The consensus method uses cross-sample consistency to weight transitions. It assumes that all transitions from a peptide should show the same fold-change pattern across samples - transitions that deviate from this consensus are down-weighted.

transition_rollup:
  method: "consensus"
  min_transitions: 3

Algorithm:

  1. Model: log2(T_ij) = α_i (transition offset) + β_j (sample abundance) + ε_ij
  2. Estimate initial β using row-centered medians (robust to outliers)
  3. Calculate residuals: ε_ij = T_ij - α_i - β_j
  4. Weight transitions: w_i = 1 / (variance(ε_i) + regularization)
  5. Return weighted sum in linear space

Key insight: Transitions with high cross-sample residual variance are likely interfered. This method requires no external data (no spectral library or shape correlation) - it learns weights purely from the data itself.

When to use: Best for DDA data or when quality metrics like shape correlation are not available.

Library-Assisted Rollup with Spectral Library

For transition->peptide rollup, the library_assist method uses a spectral library to detect and exclude interfered transitions. This is particularly effective when co-eluting peptides contribute signal to specific fragments.

transition_rollup:
  method: "library_assist"
  library_assist:
    library_path: "/path/to/library.blib"  # or .tsv (Carafe format)
    fitting_method: "median_polish"  # RECOMMENDED (or "least_squares")
    mz_tolerance: 0.02               # m/z tolerance for matching (Da)
    min_matched_fragments: 3         # Minimum fragments for valid fit
    outlier_threshold: 1.0           # 1.0 = obs > 2x predicted is outlier

Two fitting methods available:

Method Description When to Use
median_polish Uses MEDIAN to estimate scale; robust to 1-2 outliers automatically Recommended for most data
least_squares Classic OLS fitting; more sensitive to outliers Very clean data

Algorithm (median_polish, default):

  1. Match observed transitions to library fragments by m/z
  2. Model: log(observed) = log(library) + scale_factor + noise
  3. Estimate scale via MEDIAN of log-ratios (robust to up to 50% outliers)
  4. Flag transitions with HIGH positive residuals as interfered (signal > expected)
  5. Remove worst outlier and refit until convergence
  6. Final abundance = exp(scale_factor) * sum(ALL library intensities)

Key insight: Only high residuals indicate interference - because interference can only add signal, never remove it. Low or negative residuals are valid (low abundance or noise).

Supported library formats:

  • BLIB (Skyline): SQLite-based format with zlib-compressed peak arrays
  • Carafe TSV (DIA-NN): Tab-separated format with fragment annotations

Performance: The implementation uses vectorized least squares (BLAS matrix operations) to process all samples in parallel, providing ~10x speedup on large datasets.

Note: For very large cohorts (hundreds to thousands of samples), consider directLFQ which uses a different "intensity trace" algorithm with linear runtime scaling. Our maxlfq implementation uses the original pairwise ratio approach which scales quadratically with sample count.

ComBat Batch Correction

PRISM includes a full implementation of the ComBat algorithm (Johnson et al. 2007) for empirical Bayes batch correction:

from skyline_prism import combat, combat_from_long, combat_with_reference_samples

# For wide-format data (features × samples)
corrected = combat(data, batch, covar_mod=covariates)

# For long-format data (as used in PRISM pipeline)
corrected_df = combat_from_long(
    data, 
    feature_col='peptide_modified',
    sample_col='replicate_name',
    batch_col='batch',
    abundance_col='abundance'
)

# With automatic evaluation using reference/QC samples
result = combat_with_reference_samples(
    data,
    sample_type_col='sample_type',
    # ... other parameters
)

Key features:

  • Parametric and non-parametric prior options
  • Reference batch support (adjust other batches to match reference)
  • Mean-only correction option (for unequal batch sizes)
  • Automatic evaluation using reference and QC sample CVs

Residual Analysis for Proteoform Discovery

Median polish produces residuals that capture deviations from the additive model. Following Plubell et al. 2022, peptides with large residuals should not be automatically discarded - they may indicate biologically interesting variation:

  • Proteoform differences: Different forms of the same protein (splice variants, truncations)
  • Post-translational modifications: PTMs affecting specific peptides
  • Protein processing: Cleavage products (e.g., amyloid-beta from APP)
  • Technical outliers: Interference or poor peak picking

Accessing residuals:

from skyline_prism import rollup_to_proteins, extract_peptide_residuals

# Protein rollup returns median polish results and topn results
protein_df, polish_results, topn_results = rollup_to_proteins(data, protein_groups)

# Extract residuals in long format for output
peptide_residuals = extract_peptide_residuals(polish_results)

# Columns include:
# - protein_group_id, peptide, replicate_name
# - residual: raw residual for each peptide/sample
# - residual_mad: robust measure of peptide's overall deviation
# - residual_max_abs: maximum deviation across samples

# Find peptides that consistently deviate
outliers = peptide_residuals.groupby(['protein_group_id', 'peptide']).agg({
    'residual_mad': 'first'
}).reset_index()
outliers = outliers[outliers['residual_mad'] > 0.5]  # Threshold of your choice

Core Rollup API:

For advanced use cases, the rollup_protein_matrix function provides direct access to the core protein rollup logic:

from skyline_prism import rollup_protein_matrix
import pandas as pd

# Create a peptide x sample matrix (log2 scale)
matrix = pd.DataFrame({
    'Sample1': [12.5, 14.2, 13.1],
    'Sample2': [12.8, 14.5, 13.4],
}, index=['PeptideA', 'PeptideB', 'PeptideC'])

# Roll up to protein-level abundance
result = rollup_protein_matrix(
    matrix,
    method='median_polish',
    min_peptides=3,
)

# Result contains:
# - result.abundances: pd.Series of sample abundances (log2)
# - result.residuals: dict of peptide -> sample -> residual
# - result.polish_result: MedianPolishResult (for median_polish method)
# - result.topn_result: TopNResult (for topn method)

For transition-level residuals:

from skyline_prism import rollup_transitions_to_peptides, extract_transition_residuals

# Use median_polish method for transition rollup
result = rollup_transitions_to_peptides(data, method='median_polish')

# Extract transition residuals
transition_residuals = extract_transition_residuals(result)

Shared Peptide Handling

Three strategies available:

  • all_groups (default, recommended): Apply shared peptides to ALL protein groups. Acknowledges proteoform complexity; avoids assumptions based on FASTA annotations.

  • unique_only: Only use peptides unique to a single protein group. Most conservative.

  • razor: Assign shared peptides to group with most peptides (MaxQuant-style).

Python API

from skyline_prism import (
    # Data I/O
    load_skyline_report,
    merge_skyline_reports,
    load_sample_metadata,
    
    # Normalization
    normalize_pipeline,
    rt_correction_from_reference,
    
    # Rollup
    tukey_median_polish,
    rollup_protein_matrix,  # Core protein rollup function
    rollup_to_proteins,
    rollup_transitions_to_peptides,
    
    # Batch correction
    combat,
    combat_from_long,
    combat_with_reference_samples,
    
    # Protein inference
    compute_protein_groups,
    
    # Validation
    validate_correction,
    generate_qc_report,
    
    # Visualization
    plot_intensity_distribution,
    plot_normalization_comparison,
    plot_pca,
    plot_comparative_pca,
    plot_control_correlation_heatmap,
    plot_cv_distribution,
    plot_comparative_cv,
    plot_sample_correlation_matrix,
)

Data Visualization

PRISM provides visualization functions for QC assessment and normalization evaluation. Install visualization dependencies with:

pip install skyline-prism[viz]

Intensity Distribution

Compare sample intensity distributions before/after normalization:

from skyline_prism import plot_intensity_distribution, plot_normalization_comparison

# Box plot of intensity distributions
plot_intensity_distribution(
    data,
    sample_types={"Sample_1": "reference", "Sample_2": "qc", ...},
    title="Intensity Distribution"
)

# Side-by-side comparison of normalization effect
plot_normalization_comparison(
    data_before=raw_data,
    data_after=normalized_data,
    title="Normalization Effect"
)

PCA Analysis

Visualize batch effects and normalization impact with PCA:

from skyline_prism import plot_pca, plot_comparative_pca

# Single PCA plot with sample grouping
fig, pca_df = plot_pca(
    data,
    sample_groups={"Sample_1": "Batch_A", "Sample_2": "Batch_B", ...}
)

# Comparative PCA: Original → Normalized → Batch Corrected
plot_comparative_pca(
    data_original=raw_data,
    data_normalized=normalized_data,
    data_batch_corrected=corrected_data,  # Optional
    sample_groups=sample_batches,
    figsize=(18, 6)
)

Control Sample Correlation

Assess reproducibility using correlation heatmaps for control samples:

from skyline_prism import plot_control_correlation_heatmap, plot_sample_correlation_matrix

# Correlation heatmap for reference and QC samples only
fig, corr_matrix = plot_control_correlation_heatmap(
    data,
    sample_type_col="sample_type",
    control_types=["reference", "qc"],
    method="pearson"
)

# Full sample correlation matrix
fig, corr_matrix = plot_sample_correlation_matrix(
    data,
    sample_types=sample_type_dict
)

CV Distribution

Evaluate precision using CV distributions for control samples:

from skyline_prism import plot_cv_distribution, plot_comparative_cv

# CV distribution histogram for controls
fig, cv_data = plot_cv_distribution(
    data,
    sample_type_col="sample_type",
    control_types=["reference", "qc"],
    cv_threshold=20.0
)

# Compare CV before/after normalization
plot_comparative_cv(
    data_before=raw_data,
    data_after=normalized_data,
    sample_type_col="sample_type",
    control_type="reference"
)

RT Correction Quality Assessment

Visualize RT-dependent correction effectiveness. This is critical for validating that corrections learned from reference samples generalize to held-out QC samples:

from skyline_prism import plot_rt_correction_comparison, plot_rt_correction_per_sample

# 2×2 comparison showing reference (fitted) vs QC (held-out validation)
# before and after RT correction
fig, axes = plot_rt_correction_comparison(
    data_before=data_before_correction,
    data_after=data_after_correction,
    sample_type_col="sample_type",
    reference_mean=reference_mean_df,  # Mean abundance per peptide from reference
    figsize=(14, 10)
)

# Per-sample before/after comparison
fig, axes = plot_rt_correction_per_sample(
    data_before=data_before_correction,
    data_after=data_after_correction,
    sample_type_col="sample_type",
    reference_mean=reference_mean_df,
    samples_per_type=3,  # Number of samples to show per type
    figsize=(16, 12)
)

The RT correction plots help assess:

  • Whether the spline model captures RT-dependent variation in reference samples
  • Whether corrections generalize to QC samples (held-out validation)
  • Whether any samples have unusually large residuals after correction

All visualization functions support:

  • show_plot=False to return the figure object for further customization
  • save_path="/path/to/figure.png" to save directly to file

Citation

If you use Skyline-PRISM, please cite:

Tsantilas KA et al. "A framework for quality control in quantitative proteomics." J Proteome Res. 2024. DOI: 10.1021/acs.jproteome.4c00363

Related Projects

  • Skyline - Targeted mass spectrometry environment
  • Panorama - Repository for Skyline documents

License

MIT License - see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skyline_prism-26.3.2.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skyline_prism-26.3.2-py3-none-any.whl (232.7 kB view details)

Uploaded Python 3

File details

Details for the file skyline_prism-26.3.2.tar.gz.

File metadata

  • Download URL: skyline_prism-26.3.2.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skyline_prism-26.3.2.tar.gz
Algorithm Hash digest
SHA256 f781bf2bbc238cbba19b8eb87a3af73979536d1b5595d59371ab661cfbe7901d
MD5 380182507ca0fd181b9d4d61d56a0504
BLAKE2b-256 96918968fb89086a0c95f323267e8e759d4be9c34381dddf64cac4acd4086401

See more details on using hashes here.

Provenance

The following attestation bundles were made for skyline_prism-26.3.2.tar.gz:

Publisher: release.yml on maccoss/skyline-prism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skyline_prism-26.3.2-py3-none-any.whl.

File metadata

  • Download URL: skyline_prism-26.3.2-py3-none-any.whl
  • Upload date:
  • Size: 232.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skyline_prism-26.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c00a997a3640d72de93a4898e07e0e6beac47a980ce94e0f48226e3e79e81442
MD5 db1da1da4e4c654d88dede9d41d2ecda
BLAKE2b-256 ae06684f45d21da994038f4c2a4a0d7b98124e0c12b94dca91347cb362de5a9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for skyline_prism-26.3.2-py3-none-any.whl:

Publisher: release.yml on maccoss/skyline-prism

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page