Skip to main content

Proteomics analysis toolkit for mass spectrometry data

Project description

Proteomics Analysis Toolkit

CI PyPI version Python License

A Python toolkit for analyzing mass spectrometry-based proteomics data, supporting both Skyline CSV and PRISM parquet workflows.

Features

Core Analysis Modules

  • data_import: Load Skyline CSV or PRISM parquet data, handle batch suffixes, manage sample metadata
  • preprocessing: Protein identifier parsing, sample classification, data quality assessment
  • normalization: Seven normalization methods (median, VSN, quantile, MAD, z-score, RLR, LOESS)
  • statistical_analysis: Differential protein analysis — t-tests, Wilcoxon, Mann-Whitney, mixed-effects models
  • visualization: Publication-ready plots — volcano, PCA, box plots, heatmaps, correlation, trajectories
  • enrichment: Gene set enrichment via Enrichr API
  • temporal_clustering: K-means clustering of temporal protein trends
  • validation: Metadata/data consistency checking with diagnostic reports
  • export: Standardized result export with timestamped configs

Installation

# Install from PyPI
pip install proteomics-toolkit

# With XGBoost support (for classification module)
pip install proteomics-toolkit[xgboost]

# Install from GitHub (latest development version)
pip install git+https://github.com/uw-maccosslab/proteomics-toolkit.git

# For development (editable install from local clone)
git clone https://github.com/uw-maccosslab/proteomics-toolkit.git
cd proteomics-toolkit
pip install -e .

Quick Start

PRISM Workflow (recommended for batch-corrected data)

import proteomics_toolkit as ptk
import pandas as pd

# 1. Load PRISM data
protein_data, metadata, sample_cols = ptk.load_prism_data(
    'PRISM-Output/corrected_proteins.parquet',
    'PRISM-Output/sample_metadata.csv',
)

# 2. Map batch-suffixed column names to short replicate IDs
col_map = ptk.strip_batch_suffix(sample_cols)  # {full_col: short_name}
short_to_col = {v: k for k, v in col_map.items()}

# 3. Build sample metadata dict (keys = full PRISM column names)
meta_dict = {}
for _, row in metadata.iterrows():
    full_col = short_to_col.get(row['Replicate'])
    if full_col:
        meta_dict[full_col] = row.to_dict()

# 4. Filter low-confidence proteins
protein_data_filtered = protein_data[~protein_data['low_confidence']].copy()

# 5. Build annotation + sample data for stats
annot = protein_data_filtered[[
    'leading_protein', 'leading_description', 'leading_gene_name',
    'leading_uniprot_id', 'leading_name'
]].copy()
annot.columns = ['Protein', 'Description', 'Protein Gene', 'UniProt_Accession', 'UniProt_Entry_Name']
data = pd.concat([annot.reset_index(drop=True),
                   protein_data_filtered[sample_cols].reset_index(drop=True)], axis=1)
data.index = data['Protein']  # accession as index

# 6. Statistical analysis
config = ptk.StatisticalConfig()
config.analysis_type = 'unpaired'
config.statistical_test_method = 'welch_t'
config.group_column = 'Group'
config.group_labels = ['Control', 'Treatment']  # [reference, study]
config.correction_method = 'fdr_bh'
config.p_value_threshold = 0.05
config.fold_change_threshold = 1.0
config.log_transform_before_stats = True
config.validate()

results = ptk.run_comprehensive_statistical_analysis(
    data, meta_dict, config, protein_annotations=annot
)

# 7. Visualization
ptk.plot_volcano(results, fc_threshold=1.0, gene_column='Protein Gene', label_top_n=15)
ptk.display_analysis_summary(results, config)

# 8. Enrichment
enrich_config = ptk.EnrichmentConfig(
    enrichr_libraries=['GO_Biological_Process_2023', 'KEGG_2021_Human'],
    pvalue_cutoff=0.05,
)
enrich = ptk.run_differential_enrichment(
    results, gene_column='Protein Gene', logfc_column='logFC',
    pvalue_column='adj.P.Val', config=enrich_config,
)

Skyline CSV Workflow

# 1. Load data
protein_data, metadata, peptide_data = ptk.load_skyline_data(
    protein_file='protein_quant.csv',
    metadata_file='metadata.csv',
)

# 2. Process sample names
sample_columns = ptk.data_import.identify_sample_columns(protein_data, metadata)
cleaned_names = ptk.clean_sample_names(sample_columns)

# 3. Parse annotations and filter
processed_data = ptk.parse_protein_identifiers(protein_data)

# 4. Normalize (skip for PRISM — already normalized)
normalized = ptk.median_normalize(processed_data, sample_columns=list(cleaned_names.values()))

# 5. QC plots
ptk.plot_box_plot(normalized, list(cleaned_names.values()), sample_metadata)
ptk.plot_pca(normalized, list(cleaned_names.values()), sample_metadata)

Statistical Analysis

All statistical analyses use StatisticalConfig + run_comprehensive_statistical_analysis().

Unpaired comparison (two independent groups)

config = ptk.StatisticalConfig()
config.analysis_type = 'unpaired'
config.statistical_test_method = 'welch_t'  # or 'mann_whitney'
config.group_column = 'Group'
config.group_labels = ['Control', 'Treatment']
config.log_transform_before_stats = 'auto'
config.validate()

results = ptk.run_comprehensive_statistical_analysis(
    data, sample_metadata, config, protein_annotations=annot
)

Paired comparison (before/after per subject)

config = ptk.StatisticalConfig()
config.analysis_type = 'paired'
config.statistical_test_method = 'paired_t'
config.subject_column = 'Subject'
config.paired_column = 'Condition'
config.paired_label1 = 'Before'
config.paired_label2 = 'After'
config.group_column = 'Condition'
config.group_labels = ['Before', 'After']
config.validate()

Mixed-effects model (repeated measures)

config = ptk.StatisticalConfig()
config.analysis_type = 'paired'
config.statistical_test_method = 'mixed_effects'
config.subject_column = 'Subject'
config.paired_column = 'Visit'
config.paired_label1 = 'Baseline'
config.paired_label2 = 'Follow-up'
config.group_column = 'Treatment'
config.group_labels = ['Placebo', 'Drug']
config.interaction_terms = ['Treatment', 'Visit']
config.validate()

Output columns: Protein, logFC, P.Value, adj.P.Val, AveExpr, t, Protein Gene, Description, UniProt_Accession, Gene

Enrichment

Enrichment results use these column names (not the Enrichr web-UI names):

Column Description
Term Pathway / GO term name
P_Value Unadjusted p-value
Adj_P_Value BH-adjusted p-value
Z_Score Enrichr z-score
Combined_Score log(p) × z — used for ranking
Genes Semicolon-separated gene list
N_Genes Number of overlapping genes
Library Source Enrichr library

Dependencies

  • pandas >= 1.3.0
  • numpy >= 1.21.0
  • scipy >= 1.7.0
  • matplotlib >= 3.4.0
  • seaborn >= 0.11.0
  • scikit-learn >= 1.0.0
  • statsmodels >= 0.12.0
  • requests >= 2.25.0 (for Enrichr API)
  • pyarrow >= 8.0.0 (for PRISM parquet files)

Module Reference

data_import.py

  • load_skyline_data() — Load Skyline protein/peptide CSVs + metadata
  • load_prism_data() — Load PRISM parquet + metadata
  • identify_sample_columns() — Auto-detect sample columns
  • clean_sample_names() — Remove common prefixes/suffixes
  • detect_batch_suffix() — Detect PRISM __@__ batch suffix
  • strip_batch_suffix() — Map batch-suffixed names → short names
  • create_sample_column_mapping() — Map data columns to metadata sample names
  • match_samples_to_metadata() — Link samples to metadata rows
  • BATCH_SUFFIX_DELIMITER — Constant: "__@__"

preprocessing.py

  • parse_protein_identifiers() — Extract UniProt accessions and databases
  • parse_gene_and_description() — Parse gene names from descriptions
  • classify_samples() — Classify samples into groups / controls with color assignment
  • apply_systematic_color_scheme() — Generate consistent group colors
  • create_standard_data_structure() — Build standard 5-column annotation + sample layout
  • assess_data_completeness() — Evaluate missing data patterns
  • filter_proteins_by_completeness() — Remove proteins below detection threshold
  • calculate_group_colors() — Generate group color mapping
  • identify_annotation_columns() — Auto-detect annotation vs sample columns

normalization.py

  • median_normalize() — Median-based normalization (preserves original scale)
  • vsn_normalize() — Variance Stabilizing Normalization (arcsinh-transformed)
  • quantile_normalize() — Force identical distributions
  • mad_normalize() — Median absolute deviation normalization
  • z_score_normalize() — Standardize to mean=0, sd=1
  • rlr_normalize() — Robust linear regression (log2-transformed)
  • loess_normalize() — LOESS intensity-dependent (log2-transformed)
  • handle_negative_values() — Handle negative values from VSN
  • analyze_negative_values() — Analyze negative value patterns
  • calculate_normalization_stats() — Evaluate normalization effectiveness

statistical_analysis.py

  • StatisticalConfig — Configuration class (zero-arg constructor, set attributes individually)
  • run_comprehensive_statistical_analysis() — Main analysis entry point
  • display_analysis_summary() — Print/return summary of results
  • run_statistical_analysis() — Backward-compatible wrapper

visualization.py

  • plot_box_plot() — Sample intensity distributions by group
  • plot_volcano() — Volcano plot with labeled top hits
  • plot_pca() — PCA with group coloring, optional log-transform
  • plot_comparative_pca() — Compare PCA across normalization methods
  • plot_normalization_comparison() — Before/after normalization QC
  • plot_sample_correlation_heatmap() — Full correlation matrix
  • plot_sample_correlation_triangular_heatmap() — Lower-triangle correlation
  • plot_control_correlation() — Control sample correlation with optional clustering
  • plot_control_correlation_analysis() — Multi-panel control QC
  • plot_control_group_correlation_analysis() — Group-wise control QC
  • plot_individual_control_pool_analysis() — Individual control analysis
  • plot_control_cv_distribution() — CV distribution for control samples
  • plot_grouped_heatmap() — Heatmap for any grouped data
  • plot_grouped_trajectories() — Line plots for temporal/dose-response data
  • plot_protein_profile() — Single protein expression profile

enrichment.py

  • EnrichmentConfig — Configuration dataclass (libraries, thresholds, API settings)
  • query_enrichr() — Query Enrichr API with a gene list
  • parse_enrichr_results() — Parse raw results into a tidy DataFrame
  • run_enrichment_analysis() — Complete enrichment on a gene list
  • run_enrichment_by_group() — Enrichment for each group in a DataFrame
  • run_differential_enrichment() — Split by up/down-regulated, run enrichment on each
  • plot_enrichment_barplot() — Horizontal bar plot by Combined Score
  • plot_enrichment_comparison() — Dot plot comparing enrichment across groups
  • get_available_libraries() — List common Enrichr libraries
  • merge_enrichment_results() — Merge multiple enrichment DataFrames

temporal_clustering.py

  • TemporalClusteringConfig — Configuration dataclass
  • run_temporal_analysis() — Complete pipeline: clustering → visualization → enrichment
  • calculate_temporal_means() — Mean abundance per timepoint across subjects
  • cluster_temporal_trends() — K-means or hierarchical clustering
  • name_clusters_by_pattern() — Assign descriptive cluster names
  • classify_trend_pattern() — Classify individual protein trends
  • merge_with_statistics() — Merge temporal data with statistical results
  • filter_significant_proteins() — Filter to significant proteins
  • run_enrichment_by_cluster() — Enrichment per cluster
  • plot_cluster_heatmap() — Cluster-organized heatmap
  • plot_cluster_parallel_coordinates() — Parallel coordinate plots

validation.py

  • validate_metadata_data_consistency() — Check metadata matches data columns
  • enhanced_sample_processing() — Sample processing with validation
  • generate_sample_matching_diagnostic_report() — Detailed mismatch diagnostics
  • SampleMatchingError — Exception for sample matching failures
  • ControlSampleError — Exception for control sample configuration issues

export.py

  • export_complete_analysis() — Full export: data + config + results
  • export_analysis_results() — Export normalized data + differential results
  • export_timestamped_config() — Save analysis config with timestamp
  • create_config_dict_from_notebook_vars() — Build config dict from notebook variables
  • export_significant_proteins_summary() — Export significant results summary
  • export_results() — General-purpose result export

See Also

  • Usage Guide -- Detailed recipe book with usage patterns
  • CLAUDE.md — Project conventions and data prep patterns

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proteomics_toolkit-26.1.0.tar.gz (116.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proteomics_toolkit-26.1.0-py3-none-any.whl (111.7 kB view details)

Uploaded Python 3

File details

Details for the file proteomics_toolkit-26.1.0.tar.gz.

File metadata

  • Download URL: proteomics_toolkit-26.1.0.tar.gz
  • Upload date:
  • Size: 116.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proteomics_toolkit-26.1.0.tar.gz
Algorithm Hash digest
SHA256 b962b9b121841a12982d3f3e0da515e2164f8d439196bc9cd4f767b2faff6b36
MD5 b4a8194db288ac3184f4a67b9620d18f
BLAKE2b-256 45b16cc91913999206b4685e665f528e94091fed9cf4dfe58eab9c32c65c0282

See more details on using hashes here.

Provenance

The following attestation bundles were made for proteomics_toolkit-26.1.0.tar.gz:

Publisher: publish.yml on uw-maccosslab/proteomics-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file proteomics_toolkit-26.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for proteomics_toolkit-26.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65503e52bbd9e31526c4c1fefe8a26b1adba77aaf8e7d1876ac7e98fb192df66
MD5 507f40ae36c3cdcc30ad77f389553edc
BLAKE2b-256 a523b0627875886de5e8c36c45a7a09c2140f2f0d1c8cd1a0aebd179cc698a60

See more details on using hashes here.

Provenance

The following attestation bundles were made for proteomics_toolkit-26.1.0-py3-none-any.whl:

Publisher: publish.yml on uw-maccosslab/proteomics-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page