A comprehensive LC-MS metabolomics data quality control module.
Project description
pi-metaboqc: $\pi$-Metabolomics-Quality Control
pi-metaboqc is a high-performance, fully automated data quality control pipeline designed specifically for large-scale, multi-batch clinical metabolomics.
โจ Core Capabilities
-
Pure Python Ecosystem & Native Pandas Integration: The core data structure,
MetaboInt, natively inherits frompandas.DataFrame. All underlying calculations are strictly implemented using industry-standard libraries likeSciPyandscikit-learn. Furthermore, classical methods that traditionally relied on R (such as Quantile Normalization and VSN) have been completely reconstructed in Python, achieving statistically equivalent results and breaking down language barriers. -
Intelligent Missing Value Management: Built-in heuristic algorithms automatically identify and distinguish between MAR (Missing at Random) and MNAR (Missing Not at Random) metabolite features. By evaluating statistical metrics like NRMSE (Normalized Root Mean Square Error), the pipeline auto-tunes and selects the most appropriate filtering and imputation strategies for your specific dataset.
-
Dual-Engine High-Performance Computing: Powered by a synergistic integration of
joblibfor multi-core parallelization andNumbafor Just-In-Time (JIT) compilation. This architecture effortlessly accelerates computationally intensive tasksโsuch as baseline modeling and cross-validationโto near-C speeds, drastically reducing turnaround times for massive clinical cohorts. -
End-to-End Quality Assessment (QA): Provides comprehensive data evaluation functions spanning the entire pipeline. From raw data import and missing value handling to signal drift correction and normalization, the distribution and quality of your data are clearly monitored and controllable at every single step.
-
Dual-Tier Automated Reporting & Publication-Ready Visualizations: The pipeline silently captures critical retention metrics and statistical parameters across all stages, offering users the flexibility to generate either Brief (executive summary) or Comprehensive (deep-dive audit) PDF/Markdown reports with a single click. Furthermore, all diagnostic plots are natively exported in lossless SVG format, ensuring they are instantly ready for high-fidelity editing in Adobe Illustrator or Microsoft PowerPoint for journal submission.
๐ฆ Installation
We strongly recommend installing pi-metaboqc within a Conda virtual environment using Miniforge (preferred), Miniconda, or Anaconda.
Generating high-fidelity HTML and PDF reports requires advanced graphical engines (pandoc, weasyprint, and librsvg). These tools depend on complex, system-level C libraries (e.g., GTK3, Pango) that are notoriously difficult to compile and configure via standard pip, particularly on Windows.
Conda effortlessly resolves these low-level dependencies. To guarantee maximum stability across all operating systems, please follow the Standard Installation guide below.
โ ๏ธ Note: While we have integrated an automatic fallback download feature for missing dependencies, it has not been exhaustively tested across all edge cases. Proceeding with the Conda installation remains the most robust and officially supported approach.
Step 1: Create and Activate Conda Environment
conda create -n metaboqc python=3.13 pip -y
conda activate metaboqc
Step 2: Pre-install Graphical Engines (Recommended)
Install pandoc, weasyprint and librsvg via conda-forge to ensure all necessary system graphical libraries are correctly linked before installing the Python package:
conda install -c conda-forge pandoc weasyprint librsvg -y
Step 3: Install pi-metaboqc
For standard users: Install the stable release directly from PyPI:
pip install pi-metaboqc
Alternatively, install the latest development version directly from GitHub:
pip install git+https://github.com/KaikunXu/pi-metaboqc.git
For developers (Editable mode): If you plan to modify the source code or contribute to the project:
git clone https://github.com/KaikunXu/pi-metaboqc.git
cd pi-metaboqc
pip install -e .
๐ Quickstart & Tutorials
pi-metaboqc is designed for zero-friction deployment. You only need three files to trigger the fully automated pipeline: a sample metadata table, a raw intensity matrix, and a TOML configuration file.
We provide execution modalities for different use cases in the examples/ directory. For first-time users, we strongly recommend starting with the Interactive Notebook.
1. Interactive Notebook (Recommended for Onboarding)
Interactive Tutorial (interactive_tutorial.ipynb): An end-to-end Jupyter Notebook. This is the optimal way to experience pi-metaboqc. It allows you to step through the pipeline, visually inspect intermediate QA diagnostic dashboards (including model_overview plots with Q2 metrics, natively rendered as high-fidelity SVGs), and intuitively grasp the core algorithmic logic.
Choose the access method that best suits your network environment:
- Static Viewer (nbviewer): Delivers fast, static rendering. Recommended for users in mainland China to ensure all inline SVG plots are displayed reliably without execution overhead or connectivity issues.
- Google Colab: A cloud-executable environment. Best for global users who wish to run the pipeline dynamically with zero local configuration.
2. Headless CLI Execution (For Production & Batch Processing)
For deployment on HPC clusters or integration into larger bioinformatics workflows, utilize our robust command-line interface script (run_pimqc.py).
# Navigate to the examples directory
cd examples
# Option A: Run out-of-the-box with bundled demo data
python run_pimqc.py
# Option B: Run with your own custom clinical cohort
python run_pimqc.py \
--meta /path/to/your_meta.csv \
--intensity /path/to/your_intensity.csv \
--config /path/to/custom_params.toml \
--outdir /path/to/output_directory
# Option C: Run in silent mode
python run_pimqc.py -q
โ ๏ธ Troubleshooting Note for VS Code Users: When running the CLI script via the integrated terminal in Visual Studio Code, the IDE may occasionally fail to properly inherit full Conda environment variables. This prevents the PDF rendering engine from locating essential system-level C libraries (e.g., GTK3/Pango), causing the report generation to gracefully degrade and output an HTML report instead.
Resolution: You can bypass this by executing the script from a native system terminal (e.g., Anaconda Prompt, macOS Terminal). Alternatively, to permanently configure VS Code for seamless PDF rendering and resolve PowerShell restrictions, please refer to our VS Code Environment & Troubleshooting Guide.
Automated Refinement Protocol (Under the Hood)
Upon executing the pipeline via either modality, the system strictly follows a rigorous sequential refinement protocol:
-
Building dataset: Parses TOML or JSON configurations to seamlessly align sample metadata with the raw intensity matrix, instantiating the core
MetaboIntdata object. -
High-missing value features filtering: Heuristically classifies missing value mechanisms (MAR vs. MNAR) and eliminates invalid features exceeding predefined missing rate thresholds.
-
Intra-batch correction: Corrects inject otder-dependent instrument signal drift within individual analytical batches using pooled QCs-based robust regression models (QC-RLSC, QC-RFSC or QC-SVR).
-
Inter-batch correction: Harmonizes analytical variations across multiple independent batches, mitigating systemic batch effects to ensure global data comparability.
-
Low-quality features filtering: Precisely prunes unreliable features based on rigorous noise-filtering criteria, including Blank-to-QC intensity ratios and pooled-QC Relative Standard Deviation (RSD).
-
Missing values imputation: Executes stratified, mechanism-aware imputation on remaining missing values, either auto-tuned via NRMSE simulation benchmarks or applying user-defined algorithms.
-
Normalization: Adjusts for systematic sample-to-sample variations (e.g., biofluid dilution effects) using global scaling techniques such as PQN, Median, TIC, VSN and Quantile.
-
Quality assessment (Replicated): Operates transparently across all pipeline stages, continuously capturing statistical metrics to generate a comprehensive, publication-ready Markdown/PDF audit report.
๐ Project Structure
pi-metaboqc/
โโโ README.md # Project documentation and quickstart guide
โโโ pyproject.toml # Modern Python build and dependency config
โโโ LICENSE # MIT license
โโโ examples/ # Directory for tutorials and examples
โ โโโ interactive_tutorial.ipynb # Interactive Jupyter Notebook for onboarding
โ โโโ run_pimqc.py # Production-ready CLI execution script
โโโ src/ # Core source code directory
โ โโโ pimqc/ # Core pi-metaboqc package
โ โโโ __init__.py # Package initialization file
โ โโโ core_classes.py # Core DataStructure class (MetaboInt)
โ โโโ visualizer_classes.py # Core Visualization class (BaseMetaboVisualizer)
โ โโโ dataset_builder.py # MetaboInt instantiation
โ โโโ assessment.py # Data quality assessment
โ โโโ correction.py # Signal drift & batch correction
โ โโโ filtering.py # High-missing & low-quality features filtering
โ โโโ imputation.py # Missing values imputation
โ โโโ normalization.py # Data normalization
โ โโโ pipeline.py # Automated pipeline orchestrator
โ โโโ io_utils.py # I/O operations
โ โโโ plot_utils.py # Plotting utilities
โ โโโ pca_utils.py # Underlying PCA dimensionality reduction
โ โโโ stat_utils.py # Shared statistical utility functions
โ โโโ report_utils.py # Automated markdown and pdf report rendering
โ โโโ config_schema.py # Configuration schema and parameter validation
โ โโโ templates/... # Template file for generating reports...
โ โโโ data/ # Demo data and configuration file directory
โ โโโ project_meta.csv # Demo project metadata file
โ โโโ project_intensity.csv # Demo project intensity file
โ โโโ pipeline_parameters.toml # Demo pipeline parameters file
โโโ tests/... # Unit testing and E2E stress testing...
โโโ ... # Other files required by this module...
๐ก Note on Configuration: The entire analytical logic of
pi-metaboqcis centrally governed bypipeline_parameters.toml. Users can fine-tune missing value tolerances, SVR kernel parameters, and normalization strategies exclusively through this file, without modifying any underlying Python code.
๐ค Contributing & License
This project is licensed under the MIT License. Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pi_metaboqc-1.0.0a1.tar.gz.
File metadata
- Download URL: pi_metaboqc-1.0.0a1.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7d52ce6ccf26fade0f096ffcfc783984690ddd2e582abdd5e5dec3e9f9ece80
|
|
| MD5 |
35160d6f8500590987e39802ac274986
|
|
| BLAKE2b-256 |
42724fc2046e0783360d80a2b45711e45f87255b926d05be53c4b4896ecec9bb
|
File details
Details for the file pi_metaboqc-1.0.0a1-py3-none-any.whl.
File metadata
- Download URL: pi_metaboqc-1.0.0a1-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bd30c266aca732b56fdf3d6120145f5018f2f6306504cb5d83f3a5a8bedd28a
|
|
| MD5 |
fb14ba79288b90c128159dcc7d34ef4f
|
|
| BLAKE2b-256 |
46d7dbae05a844a414ca86b899b7162d445b24585c34e5a92b4ff764374b9698
|