Proteomics analysis toolkit for mass spectrometry data
Project description
Proteomics Analysis Toolkit
A Python toolkit for analyzing mass spectrometry-based proteomics data, supporting both Skyline CSV and PRISM parquet workflows.
Features
Core Analysis Modules
- data_import: Load Skyline CSV or PRISM parquet data, handle batch suffixes, manage sample metadata
- preprocessing: Protein identifier parsing, sample classification, data quality assessment
- normalization: Seven normalization methods (median, VSN, quantile, MAD, z-score, RLR, LOESS)
- statistical_analysis: Differential protein analysis — t-tests, Wilcoxon, Mann-Whitney, mixed-effects models
- visualization: Publication-ready plots — volcano, PCA, box plots, heatmaps, correlation, trajectories
- enrichment: Gene set enrichment via Enrichr API
- temporal_clustering: K-means clustering of temporal protein trends
- validation: Metadata/data consistency checking with diagnostic reports
- export: Standardized result export with timestamped configs
Installation
# Install from PyPI
pip install proteomics-toolkit
# With XGBoost support (for classification module)
pip install proteomics-toolkit[xgboost]
# Install from GitHub (latest development version)
pip install git+https://github.com/uw-maccosslab/proteomics-toolkit.git
# For development (editable install from local clone)
git clone https://github.com/uw-maccosslab/proteomics-toolkit.git
cd proteomics-toolkit
pip install -e .
Quick Start
PRISM Workflow (recommended for batch-corrected data)
import proteomics_toolkit as ptk
import pandas as pd
# 1. Load PRISM data
protein_data, metadata, sample_cols = ptk.load_prism_data(
'PRISM-Output/corrected_proteins.parquet',
'PRISM-Output/sample_metadata.csv',
)
# 2. Map batch-suffixed column names to short replicate IDs
col_map = ptk.strip_batch_suffix(sample_cols) # {full_col: short_name}
short_to_col = {v: k for k, v in col_map.items()}
# 3. Build sample metadata dict (keys = full PRISM column names)
meta_dict = {}
for _, row in metadata.iterrows():
full_col = short_to_col.get(row['Replicate'])
if full_col:
meta_dict[full_col] = row.to_dict()
# 4. Filter low-confidence proteins
protein_data_filtered = protein_data[~protein_data['low_confidence']].copy()
# 5. Build annotation + sample data for stats
annot = protein_data_filtered[[
'leading_protein', 'leading_description', 'leading_gene_name',
'leading_uniprot_id', 'leading_name'
]].copy()
annot.columns = ['Protein', 'Description', 'Protein Gene', 'UniProt_Accession', 'UniProt_Entry_Name']
data = pd.concat([annot.reset_index(drop=True),
protein_data_filtered[sample_cols].reset_index(drop=True)], axis=1)
data.index = data['Protein'] # accession as index
# 6. Statistical analysis
config = ptk.StatisticalConfig()
config.analysis_type = 'unpaired'
config.statistical_test_method = 'welch_t'
config.group_column = 'Group'
config.group_labels = ['Control', 'Treatment'] # [reference, study]
config.correction_method = 'fdr_bh'
config.p_value_threshold = 0.05
config.fold_change_threshold = 1.0
config.log_transform_before_stats = True
config.validate()
results = ptk.run_comprehensive_statistical_analysis(
data, meta_dict, config, protein_annotations=annot
)
# 7. Visualization
ptk.plot_volcano(results, fc_threshold=1.0, gene_column='Protein Gene', label_top_n=15)
ptk.display_analysis_summary(results, config)
# 8. Enrichment
enrich_config = ptk.EnrichmentConfig(
enrichr_libraries=['GO_Biological_Process_2023', 'KEGG_2021_Human'],
pvalue_cutoff=0.05,
)
enrich = ptk.run_differential_enrichment(
results, gene_column='Protein Gene', logfc_column='logFC',
pvalue_column='adj.P.Val', config=enrich_config,
)
Skyline CSV Workflow
# 1. Load data
protein_data, metadata, peptide_data = ptk.load_skyline_data(
protein_file='protein_quant.csv',
metadata_file='metadata.csv',
)
# 2. Process sample names
sample_columns = ptk.data_import.identify_sample_columns(protein_data, metadata)
cleaned_names = ptk.clean_sample_names(sample_columns)
# 3. Parse annotations and filter
processed_data = ptk.parse_protein_identifiers(protein_data)
# 4. Normalize (skip for PRISM — already normalized)
normalized = ptk.median_normalize(processed_data, sample_columns=list(cleaned_names.values()))
# 5. QC plots
ptk.plot_box_plot(normalized, list(cleaned_names.values()), sample_metadata)
ptk.plot_pca(normalized, list(cleaned_names.values()), sample_metadata)
Statistical Analysis
All statistical analyses use StatisticalConfig + run_comprehensive_statistical_analysis().
Unpaired comparison (two independent groups)
config = ptk.StatisticalConfig()
config.analysis_type = 'unpaired'
config.statistical_test_method = 'welch_t' # or 'mann_whitney'
config.group_column = 'Group'
config.group_labels = ['Control', 'Treatment']
config.log_transform_before_stats = 'auto'
config.validate()
results = ptk.run_comprehensive_statistical_analysis(
data, sample_metadata, config, protein_annotations=annot
)
Paired comparison (before/after per subject)
config = ptk.StatisticalConfig()
config.analysis_type = 'paired'
config.statistical_test_method = 'paired_t'
config.subject_column = 'Subject'
config.paired_column = 'Condition'
config.paired_label1 = 'Before'
config.paired_label2 = 'After'
config.group_column = 'Condition'
config.group_labels = ['Before', 'After']
config.validate()
Mixed-effects model (repeated measures)
config = ptk.StatisticalConfig()
config.analysis_type = 'paired'
config.statistical_test_method = 'mixed_effects'
config.subject_column = 'Subject'
config.paired_column = 'Visit'
config.paired_label1 = 'Baseline'
config.paired_label2 = 'Follow-up'
config.group_column = 'Treatment'
config.group_labels = ['Placebo', 'Drug']
config.interaction_terms = ['Treatment', 'Visit']
config.validate()
Output columns: Protein, logFC, P.Value, adj.P.Val, AveExpr, t, Protein Gene, Description, UniProt_Accession, Gene
Enrichment
Enrichment results use these column names (not the Enrichr web-UI names):
| Column | Description |
|---|---|
Term |
Pathway / GO term name |
P_Value |
Unadjusted p-value |
Adj_P_Value |
BH-adjusted p-value |
Z_Score |
Enrichr z-score |
Combined_Score |
log(p) × z — used for ranking |
Genes |
Semicolon-separated gene list |
N_Genes |
Number of overlapping genes |
Library |
Source Enrichr library |
Dependencies
- pandas >= 1.3.0
- numpy >= 1.21.0
- scipy >= 1.7.0
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
- scikit-learn >= 1.0.0
- statsmodels >= 0.12.0
- requests >= 2.25.0 (for Enrichr API)
- pyarrow >= 8.0.0 (for PRISM parquet files)
Module Reference
data_import.py
load_skyline_data()— Load Skyline protein/peptide CSVs + metadataload_prism_data()— Load PRISM parquet + metadataidentify_sample_columns()— Auto-detect sample columnsclean_sample_names()— Remove common prefixes/suffixesdetect_batch_suffix()— Detect PRISM__@__batch suffixstrip_batch_suffix()— Map batch-suffixed names → short namescreate_sample_column_mapping()— Map data columns to metadata sample namesmatch_samples_to_metadata()— Link samples to metadata rowsBATCH_SUFFIX_DELIMITER— Constant:"__@__"
preprocessing.py
parse_protein_identifiers()— Extract UniProt accessions and databasesparse_gene_and_description()— Parse gene names from descriptionsclassify_samples()— Classify samples into groups / controls with color assignmentapply_systematic_color_scheme()— Generate consistent group colorscreate_standard_data_structure()— Build standard 5-column annotation + sample layoutassess_data_completeness()— Evaluate missing data patternsfilter_proteins_by_completeness()— Remove proteins below detection thresholdcalculate_group_colors()— Generate group color mappingidentify_annotation_columns()— Auto-detect annotation vs sample columns
normalization.py
median_normalize()— Median-based normalization (preserves original scale)vsn_normalize()— Variance Stabilizing Normalization (arcsinh-transformed)quantile_normalize()— Force identical distributionsmad_normalize()— Median absolute deviation normalizationz_score_normalize()— Standardize to mean=0, sd=1rlr_normalize()— Robust linear regression (log2-transformed)loess_normalize()— LOESS intensity-dependent (log2-transformed)handle_negative_values()— Handle negative values from VSNanalyze_negative_values()— Analyze negative value patternscalculate_normalization_stats()— Evaluate normalization effectiveness
statistical_analysis.py
StatisticalConfig— Configuration class (zero-arg constructor, set attributes individually)run_comprehensive_statistical_analysis()— Main analysis entry pointdisplay_analysis_summary()— Print/return summary of resultsrun_statistical_analysis()— Backward-compatible wrapper
visualization.py
plot_box_plot()— Sample intensity distributions by groupplot_volcano()— Volcano plot with labeled top hitsplot_pca()— PCA with group coloring, optional log-transformplot_comparative_pca()— Compare PCA across normalization methodsplot_normalization_comparison()— Before/after normalization QCplot_sample_correlation_heatmap()— Full correlation matrixplot_sample_correlation_triangular_heatmap()— Lower-triangle correlationplot_control_correlation()— Control sample correlation with optional clusteringplot_control_correlation_analysis()— Multi-panel control QCplot_control_group_correlation_analysis()— Group-wise control QCplot_individual_control_pool_analysis()— Individual control analysisplot_control_cv_distribution()— CV distribution for control samplesplot_grouped_heatmap()— Heatmap for any grouped dataplot_grouped_trajectories()— Line plots for temporal/dose-response dataplot_protein_profile()— Single protein expression profile
enrichment.py
EnrichmentConfig— Configuration dataclass (libraries, thresholds, API settings)query_enrichr()— Query Enrichr API with a gene listparse_enrichr_results()— Parse raw results into a tidy DataFramerun_enrichment_analysis()— Complete enrichment on a gene listrun_enrichment_by_group()— Enrichment for each group in a DataFramerun_differential_enrichment()— Split by up/down-regulated, run enrichment on eachplot_enrichment_barplot()— Horizontal bar plot by Combined Scoreplot_enrichment_comparison()— Dot plot comparing enrichment across groupsget_available_libraries()— List common Enrichr librariesmerge_enrichment_results()— Merge multiple enrichment DataFrames
temporal_clustering.py
TemporalClusteringConfig— Configuration dataclassrun_temporal_analysis()— Complete pipeline: clustering → visualization → enrichmentcalculate_temporal_means()— Mean abundance per timepoint across subjectscluster_temporal_trends()— K-means or hierarchical clusteringname_clusters_by_pattern()— Assign descriptive cluster namesclassify_trend_pattern()— Classify individual protein trendsmerge_with_statistics()— Merge temporal data with statistical resultsfilter_significant_proteins()— Filter to significant proteinsrun_enrichment_by_cluster()— Enrichment per clusterplot_cluster_heatmap()— Cluster-organized heatmapplot_cluster_parallel_coordinates()— Parallel coordinate plots
validation.py
validate_metadata_data_consistency()— Check metadata matches data columnsenhanced_sample_processing()— Sample processing with validationgenerate_sample_matching_diagnostic_report()— Detailed mismatch diagnosticsSampleMatchingError— Exception for sample matching failuresControlSampleError— Exception for control sample configuration issues
export.py
export_complete_analysis()— Full export: data + config + resultsexport_analysis_results()— Export normalized data + differential resultsexport_timestamped_config()— Save analysis config with timestampcreate_config_dict_from_notebook_vars()— Build config dict from notebook variablesexport_significant_proteins_summary()— Export significant results summaryexport_results()— General-purpose result export
See Also
- Usage Guide -- Detailed recipe book with usage patterns
- CLAUDE.md — Project conventions and data prep patterns
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proteomics_toolkit-26.1.0.tar.gz.
File metadata
- Download URL: proteomics_toolkit-26.1.0.tar.gz
- Upload date:
- Size: 116.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b962b9b121841a12982d3f3e0da515e2164f8d439196bc9cd4f767b2faff6b36
|
|
| MD5 |
b4a8194db288ac3184f4a67b9620d18f
|
|
| BLAKE2b-256 |
45b16cc91913999206b4685e665f528e94091fed9cf4dfe58eab9c32c65c0282
|
Provenance
The following attestation bundles were made for proteomics_toolkit-26.1.0.tar.gz:
Publisher:
publish.yml on uw-maccosslab/proteomics-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proteomics_toolkit-26.1.0.tar.gz -
Subject digest:
b962b9b121841a12982d3f3e0da515e2164f8d439196bc9cd4f767b2faff6b36 - Sigstore transparency entry: 1296556558
- Sigstore integration time:
-
Permalink:
uw-maccosslab/proteomics-toolkit@25bbe137634bedfd513195dc0f56f90aa3e92ed0 -
Branch / Tag:
refs/tags/v26.1.0 - Owner: https://github.com/uw-maccosslab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@25bbe137634bedfd513195dc0f56f90aa3e92ed0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file proteomics_toolkit-26.1.0-py3-none-any.whl.
File metadata
- Download URL: proteomics_toolkit-26.1.0-py3-none-any.whl
- Upload date:
- Size: 111.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65503e52bbd9e31526c4c1fefe8a26b1adba77aaf8e7d1876ac7e98fb192df66
|
|
| MD5 |
507f40ae36c3cdcc30ad77f389553edc
|
|
| BLAKE2b-256 |
a523b0627875886de5e8c36c45a7a09c2140f2f0d1c8cd1a0aebd179cc698a60
|
Provenance
The following attestation bundles were made for proteomics_toolkit-26.1.0-py3-none-any.whl:
Publisher:
publish.yml on uw-maccosslab/proteomics-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proteomics_toolkit-26.1.0-py3-none-any.whl -
Subject digest:
65503e52bbd9e31526c4c1fefe8a26b1adba77aaf8e7d1876ac7e98fb192df66 - Sigstore transparency entry: 1296556674
- Sigstore integration time:
-
Permalink:
uw-maccosslab/proteomics-toolkit@25bbe137634bedfd513195dc0f56f90aa3e92ed0 -
Branch / Tag:
refs/tags/v26.1.0 - Owner: https://github.com/uw-maccosslab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@25bbe137634bedfd513195dc0f56f90aa3e92ed0 -
Trigger Event:
release
-
Statement type: