A comprehensive proteomics quantification library supporting multiple methods (iBAQ, Top3, TopN, MaxLFQ, DirectLFQ)
Project description
mokume
Why "mokume"?
The name comes from mokume-gane (木目金), a Japanese metalworking technique that fuses multiple metal layers into distinctive patterns - similar to how this library melds peptide intensities into unified protein expression profiles.
Overview
mokume is a comprehensive proteomics quantification library that supports multiple protein quantification methods including iBAQ, Top3, TopN, MaxLFQ, and DirectLFQ. It provides feature/peptide normalization, batch correction, and various summarization strategies for the quantms ecosystem. This library is an evolution of ibaqpy, extended to support a broader range of protein quantification methods beyond iBAQ.
Installation
pip install mokume
With optional extras:
# DirectLFQ support
pip install mokume[directlfq]
# Plotting support (for QC reports and visualizations)
pip install mokume[plotting]
# All optional dependencies
pip install mokume[all]
Or install from source:
git clone https://github.com/bigbio/mokume
cd mokume
pip install .
Using conda:
mamba env create -f environment.yaml
conda activate mokume
pip install .
Library Structure
mokume/
├── core/ # Core utilities and constants
│ ├── constants.py # Column names, mappings, utility functions
│ ├── logger.py # Logging utilities
│ └── write_queue.py # Async file writing
│
├── model/ # Data models and enums
│ ├── labeling.py # TMT/iTRAQ/LFQ labeling types
│ ├── normalization.py # Normalization method enums
│ ├── organism.py # Organism metadata (histones, genome size)
│ ├── quantification.py # Quantification method enums
│ ├── summarization.py # Summarization strategy enums
│ └── filters.py # Filter configuration dataclasses
│
├── normalization/ # Normalization implementations
│ ├── feature.py # Feature-level normalization
│ ├── peptide.py # Peptide-level normalization pipeline
│ └── protein.py # Protein-level normalization
│
├── preprocessing/ # Preprocessing filters
│ └── filters/ # Quality control filters
│ ├── base.py # Base filter class
│ ├── intensity.py # Intensity-based filters
│ ├── peptide.py # Peptide-level filters
│ ├── protein.py # Protein-level filters
│ ├── run_qc.py # Run/sample QC filters
│ ├── pipeline.py # Filter pipeline orchestration
│ ├── factory.py # Filter factory functions
│ └── io.py # YAML/JSON config loading
│
├── quantification/ # Protein quantification methods
│ ├── base.py # Abstract base class
│ ├── ibaq.py # iBAQ implementation
│ ├── top3.py # Top3 quantification
│ ├── topn.py # TopN quantification
│ ├── maxlfq.py # MaxLFQ algorithm (parallelized)
│ ├── directlfq.py # DirectLFQ wrapper (optional)
│ └── all_peptides.py # Sum of all peptides
│
├── summarization/ # Intensity summarization strategies
│ ├── base.py # Abstract base class
│ ├── median.py # Median summarization
│ ├── mean.py # Mean summarization
│ └── sum.py # Sum summarization
│
├── imputation/ # Missing value handling
│ └── methods.py # Imputation implementations
│
├── postprocessing/ # Data reshaping and correction
│ ├── reshape.py # Pivot operations (wide/long format)
│ ├── batch_correction.py # ComBat batch correction
│ └── combiner.py # Multi-file combining
│
├── plotting/ # Visualization (optional: pip install mokume[plotting])
│ ├── distributions.py # Distribution and box plots
│ └── pca.py # PCA and t-SNE plots
│
├── io/ # Input/Output utilities
│ ├── parquet.py # Parquet/TSV reading, AnnData creation
│ └── fasta.py # FASTA file handling
│
├── commands/ # CLI commands
│ ├── features2peptides.py # Feature to peptide conversion
│ ├── peptides2protein.py # Protein quantification
│ ├── batch_correct.py # Batch correction
│ └── visualize.py # t-SNE visualization
│
└── data/ # Static data resources
├── organisms.py # Organism histone data
└── organisms.json # Organism registry
Quantification Methods
| Method | Description | Requires FASTA | Class | Optional |
|---|---|---|---|---|
| iBAQ | Intensity-Based Absolute Quantification | Yes | peptides_to_protein() |
No |
| Top3 | Average of 3 most intense peptides | No | Top3Quantification |
No |
| TopN | Average of N most intense peptides | No | TopNQuantification |
No |
| MaxLFQ | Delayed normalization with parallelization | No | MaxLFQQuantification |
No* |
| DirectLFQ | Intensity traces with hierarchical alignment | No | DirectLFQQuantification |
Yes** |
| Sum | Sum of all peptide intensities | No | AllPeptidesQuantification |
No |
*MaxLFQ automatically uses DirectLFQ when installed for best accuracy, falling back to built-in implementation otherwise.
**DirectLFQ requires optional install: pip install mokume[directlfq]
MaxLFQ Algorithm Details
The MaxLFQQuantification class provides two implementations:
-
DirectLFQ backend (default when installed): Uses the DirectLFQ package for maximum accuracy with variance-guided pairwise alignment.
-
Built-in fallback: A parallelized implementation using peptide trace alignment:
- Aligns peptide intensity traces within each protein using median shifts
- Aggregates aligned traces using median per sample
- Scales results to preserve total peptide intensity
- Achieves ~0.95 Spearman correlation with DIA-NN's MaxLFQ values
Use force_builtin=True to always use the built-in implementation, or check maxlfq.using_directlfq to see which backend is active.
CLI Usage
Peptides to Protein Quantification
# Using iBAQ (default) - requires FASTA
mokume peptides2protein --method ibaq \
-f proteome.fasta \
-p peptides.csv \
-o proteins-ibaq.tsv
# Using Top3 - no FASTA required
mokume peptides2protein --method top3 \
-p peptides.csv \
-o proteins-top3.tsv
# Using TopN with N=5
mokume peptides2protein --method topn --topn_n 5 \
-p peptides.csv \
-o proteins-top5.tsv
# Using MaxLFQ with parallelization
# Automatically uses DirectLFQ backend if installed, otherwise built-in
mokume peptides2protein --method maxlfq \
--threads 4 \
-p peptides.csv \
-o proteins-maxlfq.tsv
# Using DirectLFQ (requires: pip install mokume[directlfq])
mokume peptides2protein --method directlfq \
-p peptides.csv \
-o proteins-directlfq.tsv
# Using Sum of all peptides
mokume peptides2protein --method sum \
-p peptides.csv \
-o proteins-sum.tsv
Full iBAQ with TPA and ProteomicRuler
mokume peptides2protein \
-f proteome.fasta \
-p peptides.csv \
-e Trypsin \
--normalize \
--tpa \
--ruler \
--ploidy 2 \
--cpc 200 \
--organism human \
--output proteins-ibaq.tsv \
--verbose \
--qc_report QC.pdf
Features to Peptides
mokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--remove_decoy_contaminants \
--remove_low_frequency_peptides \
--nmethod median \
--pnmethod globalMedian \
--output peptides-norm.csv
Preprocessing Filters
mokume supports comprehensive preprocessing filters via YAML/JSON configuration files or CLI options.
Generate example filter configuration:
mokume features2peptides --generate-filter-config filters.yaml
Use filter configuration file:
mokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--filter-config filters.yaml \
--output peptides-filtered.csv
CLI filter overrides (take precedence over config file):
mokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--filter-config filters.yaml \
--filter-min-intensity 1000 \
--filter-cv-threshold 0.3 \
--filter-charge-states "2,3,4" \
--filter-max-missed-cleavages 2 \
--output peptides-filtered.csv
CLI-only filtering (no config file):
mokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--filter-min-intensity 500 \
--filter-min-unique-peptides 2 \
--filter-max-missing-rate 0.5 \
--output peptides-filtered.csv
Batch Correction
mokume correct-batches \
-f ibaq_folder/ \
-p "*ibaq.tsv" \
-o corrected_ibaq.tsv \
--export_anndata
t-SNE Visualization
mokume tsne-visualization \
-f protein_folder/ \
-o proteins.tsv
Python API
Quantification Methods
import pandas as pd
from mokume.quantification import (
Top3Quantification,
TopNQuantification,
MaxLFQQuantification,
AllPeptidesQuantification,
get_quantification_method,
is_directlfq_available,
peptides_to_protein, # iBAQ function
)
# Load peptide data
peptides = pd.read_csv("peptides.csv")
# --- Top3 Quantification ---
top3 = Top3Quantification()
result = top3.quantify(
peptides,
protein_column="ProteinName",
peptide_column="PeptideSequence",
intensity_column="NormIntensity",
sample_column="SampleID",
)
# --- TopN Quantification (configurable N) ---
topn = TopNQuantification(n=5)
result = topn.quantify(peptides, protein_column="ProteinName", ...)
# --- MaxLFQ Quantification ---
# Automatically uses DirectLFQ if installed, otherwise falls back to built-in
maxlfq = MaxLFQQuantification(
min_peptides=2, # Minimum peptides required for MaxLFQ (uses median for fewer)
threads=4, # Use 4 parallel cores (-1 for all cores)
)
result = maxlfq.quantify(peptides, protein_column="ProteinName", ...)
# Check which implementation is being used
print(f"Using DirectLFQ: {maxlfq.using_directlfq}")
print(f"Implementation: {maxlfq.name}") # "MaxLFQ (DirectLFQ)" or "MaxLFQ (built-in)"
# For best accuracy, install DirectLFQ: pip install mokume[directlfq]
# Force built-in implementation (for testing/comparison)
maxlfq_builtin = MaxLFQQuantification(min_peptides=2, force_builtin=True)
# Run-level quantification (uses built-in implementation)
result = maxlfq.quantify(
peptides,
protein_column="ProteinName",
sample_column="SampleID",
run_column="Run", # Optional: quantify at run level instead of sample level
)
# --- DirectLFQ Quantification (standalone, optional dependency) ---
if is_directlfq_available():
from mokume.quantification import DirectLFQQuantification
directlfq = DirectLFQQuantification(min_nonan=2)
result = directlfq.quantify(peptides, protein_column="ProteinName", ...)
# --- Sum of All Peptides ---
sum_quant = AllPeptidesQuantification()
result = sum_quant.quantify(peptides, protein_column="ProteinName", ...)
# --- Factory Function ---
method = get_quantification_method("maxlfq", min_peptides=2, threads=-1)
result = method.quantify(peptides, ...)
# --- Check available methods ---
from mokume.quantification import list_quantification_methods
print(list_quantification_methods())
# {'top3': True, 'topn': True, 'maxlfq': True, 'directlfq': False, 'sum': True}
# --- iBAQ with Full Pipeline ---
peptides_to_protein(
fasta="proteome.fasta",
peptides="peptides.csv",
enzyme="Trypsin",
normalize=True,
tpa=True,
ruler=True,
ploidy=2,
cpc=200,
organism="human",
output="proteins-ibaq.tsv",
min_aa=7,
max_aa=30,
verbose=True,
qc_report="QC.pdf",
)
Normalization
from mokume.normalization.peptide import peptide_normalization
# Full peptide normalization pipeline
peptide_normalization(
parquet="features.parquet",
sdrf="experiment.sdrf.tsv",
min_aa=7,
min_unique=2,
remove_ids=None,
remove_decoy_contaminants=True,
remove_low_frequency_peptides=True,
output="peptides-norm.csv",
skip_normalization=False,
nmethod="median", # Feature normalization: mean, median, iqr, none
pnmethod="globalMedian", # Peptide normalization: globalMedian, conditionMedian, none
log2=True,
save_parquet=False,
)
Preprocessing Filters
from mokume.preprocessing.filters import (
load_filter_config,
save_filter_config,
generate_example_config,
get_filter_pipeline,
FilterPipeline,
)
from mokume.model.filters import PreprocessingFilterConfig
# Generate example configuration file
generate_example_config("filters.yaml")
# Load configuration from file
config = load_filter_config("filters.yaml")
# Create configuration programmatically
config = PreprocessingFilterConfig(
name="custom_filters",
enabled=True,
log_filtered_counts=True,
)
config.intensity.min_intensity = 1000.0
config.intensity.cv_threshold = 0.3
config.peptide.allowed_charge_states = [2, 3, 4]
config.peptide.exclude_modifications = ["Oxidation"]
config.protein.min_unique_peptides = 2
config.run_qc.max_missing_rate = 0.5
# Apply CLI-style overrides
config.apply_overrides({
"min_intensity": 500,
"charge_states": [2, 3],
"max_missing_rate": 0.3,
})
# Save configuration
save_filter_config(config, "my_filters.yaml")
# Create and use filter pipeline directly
pipeline = get_filter_pipeline(config)
import pandas as pd
df = pd.read_csv("peptides.csv")
filtered_df, results = pipeline.apply(df)
# Check filter results
for result in results:
print(f"{result.filter_name}: removed {result.removed_count} ({result.removal_rate:.1%})")
# Get pipeline summary
summary = pipeline.summary(results)
print(f"Total removed: {summary['total_removed']} / {summary['total_input']}")
# Use filters with peptide_normalization
from mokume.normalization.peptide import peptide_normalization
peptide_normalization(
parquet="features.parquet",
sdrf="experiment.sdrf.tsv",
output="peptides-filtered.csv",
nmethod="median",
pnmethod="globalMedian",
filter_config=config, # Pass filter configuration
)
Batch Correction
from mokume.io.parquet import combine_ibaq_tsv_files
from mokume.postprocessing.reshape import pivot_wider, pivot_longer
from mokume.postprocessing.batch_correction import apply_batch_correction
# Load and combine multiple TSV files
df = combine_ibaq_tsv_files("data/", pattern="*ibaq.tsv", sep="\t")
# Reshape to wide format (proteins x samples)
df_wide = pivot_wider(
df,
row_name="ProteinName",
col_name="SampleID",
values="Ibaq",
fillna=True
)
# Extract batch IDs from sample names
import pandas as pd
batch_ids = [name.split("-")[0] for name in df_wide.columns]
batch_ids = pd.factorize(batch_ids)[0]
# Apply ComBat batch correction
df_corrected = apply_batch_correction(df_wide, list(batch_ids), kwargs={})
# Reshape back to long format
df_long = pivot_longer(
df_corrected,
row_name="ProteinName",
col_name="SampleID",
values="IbaqCorrected"
)
Data Reshaping
from mokume.postprocessing.reshape import (
pivot_wider,
pivot_longer,
remove_samples_low_protein_number,
remove_missing_values,
describe_expression_metrics,
)
# Long to wide format
df_wide = pivot_wider(df, row_name="ProteinName", col_name="SampleID", values="Ibaq")
# Wide to long format
df_long = pivot_longer(df_wide, row_name="ProteinName", col_name="SampleID", values="Ibaq")
# Quality filtering
df_filtered = remove_samples_low_protein_number(df, min_protein_num=100)
df_filtered = remove_missing_values(df, missingness_percentage=20, expression_column="Ibaq")
# Get expression statistics
metrics = describe_expression_metrics(df)
Creating AnnData Objects
from mokume.io.parquet import create_anndata
# Create AnnData from long-format DataFrame
adata = create_anndata(
df,
obs_col="SampleID", # Observation (sample) column
var_col="ProteinName", # Variable (protein) column
value_col="Ibaq", # Main data values
layer_cols=["IbaqNorm", "IbaqLog", "IbaqBec"], # Additional layers
obs_metadata_cols=["Condition"], # Sample metadata
var_metadata_cols=["GeneName"], # Protein metadata
)
# Save to h5ad
adata.write("proteins.h5ad")
FASTA Handling
from mokume.io.fasta import (
load_fasta,
digest_protein,
extract_fasta,
get_protein_molecular_weights,
)
# Load FASTA file
proteins = load_fasta("proteome.fasta")
# Digest a single protein sequence
peptides = digest_protein(
sequence="MKWVTFISLLFLFSSAYS...",
enzyme="Trypsin",
min_aa=7,
max_aa=30,
)
# Extract info for specific proteins
unique_peptide_counts, mw_dict, found_proteins = extract_fasta(
fasta="proteome.fasta",
enzyme="Trypsin",
proteins=["P12345", "P67890"],
min_aa=7,
max_aa=30,
tpa=True,
)
# Get molecular weights
mw_dict = get_protein_molecular_weights("proteome.fasta", ["P12345", "P67890"])
Organism Metadata
from mokume.model.organism import OrganismDescription
# Get available organisms
organisms = OrganismDescription.registered_organisms()
# ['human', 'mouse', 'yeast', 'drome', 'caeel', 'schpo']
# Get organism description
human = OrganismDescription.get("human")
print(human.genome_size) # Genome size in base pairs
print(human.histone_entries) # List of histone protein accessions
Computed Values Reference
| Column | Description | Method |
|---|---|---|
Ibaq |
Total intensity / theoretical peptides | iBAQ |
IbaqNorm |
ibaq / sum(ibaq) per sample |
iBAQ |
IbaqLog |
10 + log10(IbaqNorm) |
iBAQ |
IbaqPpb |
IbaqNorm * 100,000,000 |
iBAQ |
IbaqBec |
Batch effect corrected | iBAQ + ComBat |
TPA |
NormIntensity / MolecularWeight |
iBAQ |
CopyNumber |
Protein copies per cell | ProteomicRuler |
Concentration[nM] |
Protein concentration | ProteomicRuler |
Top3Intensity |
Average of top 3 peptides | Top3 |
Top{N}Intensity |
Average of top N peptides | TopN |
MaxLFQIntensity |
MaxLFQ algorithm result | MaxLFQ |
DirectLFQIntensity |
DirectLFQ intensity traces | DirectLFQ |
SumIntensity |
Sum of all peptides | Sum |
Normalization Methods
Feature-Level Normalization (--nmethod)
| Method | Description |
|---|---|
median |
Normalize by median across MS runs |
mean |
Normalize by mean across MS runs |
iqr |
Normalize by interquartile range |
none |
Skip feature normalization |
Peptide-Level Normalization (--pnmethod)
| Method | Description |
|---|---|
globalMedian |
Adjust all samples to global median |
conditionMedian |
Adjust samples within each condition to median |
none |
Skip peptide normalization |
Preprocessing Filters
mokume provides a comprehensive filter system for quality control. Filters can be configured via YAML/JSON files or CLI options.
Filter Categories
Intensity Filters
| Filter | Parameter | Default | Description |
|---|---|---|---|
| MinIntensityFilter | min_intensity |
0.0 | Remove features below threshold |
| CVThresholdFilter | cv_threshold |
null | Max CV across replicates |
| ReplicateAgreementFilter | min_replicate_agreement |
1 | Min replicates with detection |
| QuantileFilter | quantile_lower/upper |
0.0/1.0 | Remove intensity outliers |
Peptide Filters
| Filter | Parameter | Default | Description |
|---|---|---|---|
| PeptideLengthFilter | min/max_peptide_length |
7/50 | Peptide length range |
| ChargeStateFilter | allowed_charge_states |
null | Allowed charges (e.g., [2,3,4]) |
| ModificationFilter | exclude_modifications |
[] | Remove specific modifications |
| MissedCleavageFilter | max_missed_cleavages |
null | Max missed cleavages |
| SearchScoreFilter | min_search_score |
null | Min search engine score |
| SequencePatternFilter | exclude_sequence_patterns |
[] | Regex patterns to exclude |
Protein Filters
| Filter | Parameter | Default | Description |
|---|---|---|---|
| ContaminantFilter | remove_contaminants/decoys |
true | Remove contaminants/decoys |
| MinPeptideFilter | min_unique_peptides |
2 | Min unique peptides per protein |
| ProteinFDRFilter | fdr_threshold |
0.01 | Protein-level FDR |
| CoverageFilter | min_coverage |
0.0 | Min sequence coverage |
| RazorPeptideFilter | razor_peptide_handling |
"keep" | Handle shared peptides |
Run/Sample QC Filters
| Filter | Parameter | Default | Description |
|---|---|---|---|
| RunIntensityFilter | min_total_intensity |
0.0 | Min total intensity per run |
| MinFeaturesFilter | min_identified_features |
0 | Min features per run |
| MissingRateFilter | max_missing_rate |
1.0 | Max missing value rate |
| SampleCorrelationFilter | min_sample_correlation |
null | Min replicate correlation |
Example Filter Configurations
We provide several pre-configured filter templates for common use cases in tests/example/filters/:
| Configuration | Use Case | Description |
|---|---|---|
basic_qc.yaml |
General QC | Minimal filtering for standard experiments |
stringent_filtering.yaml |
Publication | High-confidence results with strict thresholds |
tmt_labeling.yaml |
TMT/iTRAQ | Optimized for multiplexed labeling experiments |
dia_analysis.yaml |
DIA | Optimized for DIA-NN, Spectronaut analysis |
exploratory_analysis.yaml |
Exploration | Minimal filtering for data exploration |
Example: Basic QC Configuration
# basic_qc.yaml - Minimal filtering for standard experiments
name: basic_qc
enabled: true
intensity:
remove_zero_intensity: true
peptide:
min_peptide_length: 7
max_peptide_length: 50
protein:
min_unique_peptides: 2
remove_contaminants: true
remove_decoys: true
contaminant_patterns:
- CONTAMINANT
- ENTRAP
- DECOY
Use these configurations directly:
mokume features2peptides \
-p features.parquet \
--filter-config tests/example/filters/basic_qc.yaml \
-o peptides.csv
Data Processing Pipeline
Feature Processing (features2peptides)
- Parse protein identifiers and retain unique peptides
- Remove entries with empty intensity or condition
- Filter peptides by minimum amino acids
- Remove low-confidence proteins (< min_unique peptides)
- Optionally remove decoys, contaminants, and specified proteins
- Normalize at feature level between MS runs
- Merge peptidoforms across fractions and technical replicates
- Normalize at sample level
- Remove low-frequency peptides
- Assemble peptidoforms to peptides
- Optional log2 transformation
iBAQ Calculation
- Load peptide intensity data
- Extract protein info from FASTA (theoretical peptide counts, MW)
- Group peptide intensities by protein, sample, and condition
- Sum protein intensities within each group
- Normalize by detected peptide count
- Divide by theoretical peptide count
- Optional: Calculate TPA, copy number, concentration
Citation
Zheng P, Audain E, Webel H, Dai C, Klein J, Hitz MP, Sachsenberg T, Bai M, Perez-Riverol Y. Ibaqpy: A scalable Python package for baseline quantification in proteomics leveraging SDRF metadata. J Proteomics. 2025;317:105440. doi: 10.1016/j.jprot.2025.105440.
Wang H, Dai C, Pfeuffer J, Sachsenberg T, Sanchez A, Bai M, Perez-Riverol Y. Tissue-based absolute quantification using large-scale TMT and LFQ experiments. Proteomics. 2023;23(20):e2300188. doi: 10.1002/pmic.202300188.
License
MIT License - see LICENSE for details.
Credits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mokume-0.1.0.tar.gz.
File metadata
- Download URL: mokume-0.1.0.tar.gz
- Upload date:
- Size: 87.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ad876b470c6098e80647e1ae6f839270f6bff60b4b5d98b1511a317ac919fbd
|
|
| MD5 |
80efbd2a84478a451f8a41d2499c8ebe
|
|
| BLAKE2b-256 |
01920f590d7ba2ecf79a2276ff93d2d65da1399a0c574077767cb148edf420b2
|
File details
Details for the file mokume-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mokume-0.1.0-py3-none-any.whl
- Upload date:
- Size: 108.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce4f8db5a1cc8b1d445c223a311cebfafcbb0db79fa17ac4cc6d802cde025cb5
|
|
| MD5 |
3568b115e297088a3f0a8b2a7c0e81dc
|
|
| BLAKE2b-256 |
ea724425361b8cce797784be370b1b5737f0e7e93dc8c116b2c38a14b79657f0
|