Skip to main content

Python library for PRIDE Affinity Proteomics (PAD) archive data

Project description

pyprideap

PyPI version Python CI License Downloads

Python PRIDE Affinity Proteomics (pyprideap), a library for reading, validating, and analyzing affinity proteomics datasets from the PRIDE Affinity Archive (PAD).

Supports Olink (Explore, Explore HT, Target, Reveal) and SomaScan platforms.

Installation

Install pyprideap directly from PyPI:

pip install pyprideap

Or from source:

pip install "pyprideap[all] @ git+https://github.com/PRIDE-Archive/pyprideap.git"

With plotting and QC report support:

pip install "pyprideap[plots]"

With statistical testing:

pip install "pyprideap[all]"

Quick Start

Read a dataset

import pyprideap as pp

# Auto-detect format from file extension and content
dataset = pp.read("olink_npx.csv")
dataset = pp.read("raw_data.adat")
dataset = pp.read("data.parquet")

# Force platform when auto-detection is ambiguous
dataset = pp.read("ambiguous.csv", platform="olink")
dataset = pp.read("ambiguous.csv", platform="somascan")

Generate a QC report

dataset = pp.read("olink_npx.csv")
pp.qc_report(dataset, "my_report.html")

The report includes interactive plots: expression distributions, PCA/t-SNE, LOD analysis, sample correlation, data completeness, CV distributions, and more. All plots are rendered with Plotly and include help tooltips explaining how to interpret each visualization.

Validate against PRIDE-AP guidelines

results = pp.validate(dataset)

for r in results:
    print(f"[{r.level.value}] {r.rule}: {r.message}")

Compute statistics

stats = pp.compute_stats(dataset)
print(stats.summary())

Fetch data from PRIDE Archive

client = pp.PrideClient()
project = client.get_project("PAD000001")
files = client.list_files("PAD000001")
urls = client.get_download_urls("PAD000001")

Command-Line Interface

pyprideap includes a CLI (powered by Click) for generating QC reports:

# From a local file (format auto-detected)
pyprideap report data.npx.csv
pyprideap report data.parquet -o my_report.html

# Force platform type
pyprideap report data.csv -p olink
pyprideap report data.adat -p somascan

# From a PRIDE accession (downloads data automatically)
pyprideap report -a PAD000001

# Generate individual plot files instead of a single report
pyprideap report data.npx.csv --split -o plots_dir/

# Include SDRF metadata for volcano plots
pyprideap report data.npx.csv --sdrf samples.sdrf.tsv

# Enable verbose logging (shows format detection, LOD method, PCA variance, etc.)
pyprideap report data.npx.csv -v

# List proteins above LOD from a local file
pyprideap proteins-above-lod data.npx.csv
pyprideap proteins-above-lod data.npx.csv -t 80 -o proteins.txt

# List proteins above LOD from a PRIDE accession
pyprideap proteins-above-lod -a PAD000001

Or via python -m:

python -m pyprideap report data.npx.csv

Verbose mode

Use -v / --verbose to enable detailed debug logging. This shows progress through each processing stage:

Reading olink_npx.csv...
08:12:01 [DEBUG] pyprideap.io.readers.registry: Format detected: olink_csv
08:12:01 [DEBUG] pyprideap.io.readers.olink_csv: Sample key selected: SampleID
08:12:01 [DEBUG] pyprideap.io.readers.olink_csv: Pivot shape: 20 samples x 1470 features
  20 samples, 1470 features (olink_explore)
08:12:01 [DEBUG] pyprideap.processing.lod: LOD method selected: REPORTED
08:12:02 [DEBUG] pyprideap.viz.qc.compute: Computing PCA...
08:12:02 [DEBUG] pyprideap.viz.qc.compute: PCA: variance explained=[0.42, 0.18]
...

QC Report

The HTML report is a self-contained, interactive document with a sidebar table of contents. It includes:

Section Plots
Quality Overview LOD source comparison, QC x LOD stacked bar
Signal & Distribution Per-sample expression histograms, protein detectability
Data Completeness Per-sample above/below LOD, missing frequency distribution
Sample Relationships PCA / t-SNE (dropdown toggle), sample correlation heatmap, clustered expression heatmap
Normalization QC Hybridization control scale (SomaScan)
Variability CV distribution, intra/inter-plate CV

Each plot has a ? help button with guidance on interpretation.

Embedding reports in web pages

Reports automatically detect when loaded inside an <iframe> and switch to an embedded mode that hides the header, sidebar, and footer:

<iframe
  src="my_report.html"
  style="width: 100%; border: none; min-height: 600px;"
  id="qc-report">
</iframe>

<script>
// Auto-resize iframe to fit content
window.addEventListener('message', function(e) {
  if (e.data && e.data.type === 'pride-qc-resize') {
    document.getElementById('qc-report').style.height = e.data.height + 'px';
  }
});
</script>

The embedded report posts pride-qc-resize messages with the document height, allowing the parent page to resize the iframe automatically. The CSS class pride-embedded is added to the body, which:

  • Removes the sidebar navigation, header, and footer
  • Makes the background transparent
  • Removes card shadows for a seamless look

SDRF Integration

pyprideap can read SDRF (Sample and Data Relationship Format) files and merge sample metadata into datasets:

from pyprideap.io.readers.sdrf import read_sdrf, merge_sdrf, get_grouping_columns

# Read and parse an SDRF file
sdrf = read_sdrf("samples.sdrf.tsv")

# Merge SDRF metadata into an existing dataset
dataset = pp.read("olink_npx.csv")
dataset = merge_sdrf(dataset, sdrf)

# Identify columns suitable for differential expression grouping
group_cols = get_grouping_columns(sdrf)
# e.g. ["disease", "sex", "treatment"]

Column names are automatically shortened from the full SDRF syntax (e.g. characteristics[disease] becomes disease). Duplicate column names are disambiguated with numeric suffixes.

Supported File Formats

Format Platform Function
.npx.csv Olink Explore / Target pp.read()
.parquet Olink Explore HT pp.read()
.xlsx Olink pp.read()
.adat SomaScan pp.read()
.csv (SomaScan) SomaScan pp.read()
.sdrf.tsv Any read_sdrf()

All readers produce an AffinityDataset with a unified structure regardless of input format.

Data Model

@dataclass
class AffinityDataset:
    platform: Platform          # OLINK_EXPLORE, OLINK_EXPLORE_HT, SOMASCAN, etc.
    samples: pd.DataFrame       # Sample metadata (SampleID, SampleType, QC flags, ...)
    features: pd.DataFrame      # Protein/aptamer annotations (OlinkID, UniProt, Panel, ...)
    expression: pd.DataFrame    # Quantification matrix (NPX or RFU)
    metadata: dict              # Platform-specific extras

LOD (Limit of Detection)

pyprideap supports multiple LOD sources with automatic fallback:

  1. Reported LOD — from the LOD column in the data file
  2. NCLOD — computed from negative control samples (requires >= 10 controls)
  3. FixedLOD — pre-computed Olink reference values (bundled for Explore, Explore HT, Reveal)
  4. eLOD — estimated from buffer samples using MAD formula (SomaScan)

Statistical Analysis

With pip install "pyprideap[stats]":

# Per-protein t-test between groups
results = pp.ttest(dataset, group_var="Treatment")

# Wilcoxon rank-sum test
results = pp.wilcoxon(dataset, group_var="Treatment")

# ANOVA with covariates
results = pp.anova(dataset, group_var="Treatment", covariates=["Age", "Sex"])

# Post-hoc pairwise comparisons
posthoc = pp.anova_posthoc(dataset, group_var="Treatment")

Normalization

# Bridge normalization (combining two runs with shared samples)
normalized = pp.bridge_normalize(dataset1, dataset2, bridge_samples=["S1", "S2"])

# Subset normalization using reference proteins
normalized = pp.subset_normalize(dataset1, dataset2, reference_proteins=["P1", "P2"])

# Reference median normalization
normalized = pp.reference_median_normalize(dataset, reference_medians=medians)

# Select optimal bridge samples
bridges = pp.select_bridge_samples(dataset, n=8)

# Assess bridgeability between product versions
report = pp.assess_bridgeability(dataset1, dataset2)

Additional normalization methods are available via direct import:

from pyprideap.processing.normalization import (
    lift_somascan,                # Cross-version SomaScan calibration (5k ↔ 7k ↔ 11k)
    quantile_smooth_normalize,    # Quantile normalization with smoothing
    scale_analytes,               # Per-analyte multiplicative scaling
    normalize_n,                  # Multi-step normalization pipeline
)

Preprocessing Pipelines

Platform-specific preprocessing pipelines bundle common QC and filtering steps:

from pyprideap.processing.olink import preprocess_olink
from pyprideap.processing.somascan import preprocess_somascan

# Olink: filter controls, detect outliers, LOD filtering, UniProt dedup
dataset, report = preprocess_olink(
    dataset,
    filter_controls=True,
    filter_qc_outliers=True,
    filter_lod=False,
)

# SomaScan: filter features/controls, RowCheck QC, outlier detection
dataset, report = preprocess_somascan(
    dataset,
    filter_features=True,
    filter_controls=True,
    filter_rowcheck=True,
)

print(report.summary())

Experimental Design

# Randomize samples to plates
plate_assignment = pp.randomize_plates(
    samples=sample_df,
    n_plates=4,
    keep_paired="SubjectID",  # keep longitudinal samples on same plate
    seed=42,
)

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyprideap-1.1.0.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyprideap-1.1.0-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file pyprideap-1.1.0.tar.gz.

File metadata

  • Download URL: pyprideap-1.1.0.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyprideap-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2dcf56e29158b069bda72278f03ad52a9f0129a4ac54a02a87496e3202df6e53
MD5 5abe9bd69813240dd5e98d83031787ae
BLAKE2b-256 a999a8509fa7a30df10cb1744014356916b81ff1ade42ad1288c05f80831d2d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyprideap-1.1.0.tar.gz:

Publisher: publish.yml on PRIDE-Archive/pyprideap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyprideap-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyprideap-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyprideap-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 12bc3fa76fe8d81c32c2e3f55254e9ffa0dcbfd1aeea7b1ae2c5565e83514057
MD5 8e6d56dd1af0526418c708a0485ace6e
BLAKE2b-256 bd39a929f22a0aecf3503e68aec36e238ff384c3482d36c88c75db2a552df795

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyprideap-1.1.0-py3-none-any.whl:

Publisher: publish.yml on PRIDE-Archive/pyprideap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page