Common lightweight PyTerrier adapters for specific artifact shapes - query and document processors for modular IR pipelines
Project description
MP-Procs: Modular Pipeline Query and Document Processors
A PyTerrier-compatible package providing modular query and document processing components for information retrieval pipelines.
Overview
MP-Procs offers a collection of transformers that can be seamlessly integrated into PyTerrier pipelines to enhance query processing and document indexing. The package is designed for experimental IR research, allowing researchers to easily combine and evaluate different processing strategies.
Features
Query Processors (mp_procs.qproc)
-
Segmentation-based Query Enhancement
weighted_segmentation_boost: Boost terms in multi-word segmentsappend_segmentation_with_or: Append segments using logical OR with #syn/#band operatorssynonym_segmentation: Treat segments as synonym groups with curly braces
-
Query Intelligence
intent_trigger_weighted: Add intent-specific trigger phrases based on predicted query intentsingle_rare_term_emphasis_weighted: Emphasize rare terms based on IDF statistics
-
Text Preprocessing
sanitize_column_transform: Remove special characters and clean query text
Document Processors (mp_procs.dproc)
-
Query Generation
append_query_gen: Generate additional queries using DocT5Query
-
Content Enhancement
process_keyphrases: Extract and append keyphrases to document content
Installation
# Install from local source
cd /path/to/thesis_code/mp_procs
pip install -e .
Requirements
python-terrier >= 0.10.0pyterrier-alphapandas
Pipeline Integration
Complete IR Pipeline Example
import pyterrier as pt
import pyterrier-alpha as pta
from mp_procs.qproc import weighted_segmentation_boost, sanitize_column_transform
from mp_procs.dproc import process_keyphrases
# Load dataset
dataset = pt.get_dataset("irds:trec-robust04")
topics, qrels = dataset.get_topics("title"), dataset.get_qrels()
# Query processing
query_artifact = pta.Artifact.from_url("tira:disks45/nocr/trec-robust-2004/ows/query-segmentation-hyb-a")
query_enhancer = weighted_segmentation_boost(boost_weight=1.2)
query_pipeline = query_artifact >> query_enhancer
topics = query_pipeline(topics)
# Document processing and indexing
keyphrase_proc = process_keyphrases(artifact=keyphrase_artifact, repeat=2)
indexer = pt.IterDictIndexer("./enhanced_index", overwrite=True, meta={"docno": 20})
index_ref = (keyphrase_proc >> indexer).index(dataset.get_corpus_iter())
# Retrieval
retriever = pt.terrier.Retriever(index_ref, wmodel="BM25")
# Complete pipeline
pipeline = query_pipeline >> retriever
# Evaluate
results = pipeline.transform(topics)
evaluation = pt.Evaluate(results, qrels, metrics=["map", "ndcg_cut_10"])
print(evaluation)
Configuration Options
Query Processors
| Function | Key Parameters | Description |
|---|---|---|
weighted_segmentation_boost |
boost_weight=1.2, seg_col="segmentation" |
Boost factor for segment terms |
append_segmentation_with_or |
seg_col="segmentation" |
Use #syn/#band operators |
synonym_segmentation |
seg_col="segmentation" |
Create {synonym groups} |
intent_trigger_weighted |
trigger_weight=1.5, intent_col="intent_prediction" |
Weight for trigger phrases |
single_rare_term_emphasis_weighted |
emphasis_weight=1.5, avg_idf_low=5.0 |
Rare term boosting thresholds |
sanitize_column_transform |
source_col="query", target_col="query" |
Column mapping for cleaning |
Document Processors
| Function | Key Parameters | Description |
|---|---|---|
append_query_gen |
repeat=1 |
Number of generated queries to append |
process_keyphrases |
repeat=1 |
Number of keyphrase extractions |
Entry Points
The package registers entry points for automatic discovery:
[project.entry-points."modpipe.qproc"]
SegmentationWeighted = "mp_procs.qproc:weighted_segmentation_boost"
SegmentationAppendOr = "mp_procs.qproc:append_segmentation_with_or"
SegmentationSynonyms = "mp_procs.qproc:synonym_segmentation"
PredictorBoost = "mp_procs.qproc:single_rare_term_emphasis_weighted"
IntentTriggerAppend = "mp_procs.qproc:intent_trigger_weighted"
SpellingModification = "mp_procs.qproc:sanitize_column_transform"
[project.entry-points."modpipe.dproc"]
DocT5Query = "mp_procs.dproc:append_query_gen"
KeyphraseExtraction = "mp_procs.dproc:process_keyphrases"
Testing
Run the test suite:
cd mp_procs
python -m pytest tests/ -v
Or run specific tests:
python -m unittest tests.test_queryprocessor.TestQueryProcessors.test_weighted_segmentation_boost_basic
Error Handling
All processors gracefully handle missing columns and invalid data:
- Missing required columns → return DataFrame unchanged
- Empty/null values → skip processing for those rows
- Invalid data types → attempt conversion or skip gracefully
- Chain compatibility → all processors accept and return pandas DataFrames
Contributing
This package is part of a research thesis on modular IR pipelines. For issues or contributions:
- Follow PyTerrier transformer conventions
- Add comprehensive tests for new processors
- Update entry points in
pyproject.toml - Maintain backward compatibility
License
This project is part of academic research. Please cite appropriately if used in publications.
Citation
If you use this package in your research, please cite:
@misc{mp_procs2025,
title={MP-Procs: Modular Pipeline Processors for Information Retrieval},
author={[Patrick Stahl]},
year={2025},
note={Thesis code for modular IR pipeline research}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mp_procs-0.1.1.tar.gz.
File metadata
- Download URL: mp_procs-0.1.1.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00cee57215133f6e73c339fd6eb6a6c9be9f05d17d44d1e67124b9127dca585e
|
|
| MD5 |
42b13b3ff5012995c45523baeb9cdf19
|
|
| BLAKE2b-256 |
14cd4f4c0d1acef736f0f0c65803eae6613b6e877620ec886989f54b1c01a9f2
|
File details
Details for the file mp_procs-0.1.1-py3-none-any.whl.
File metadata
- Download URL: mp_procs-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8c1af70578aecce587963335597d9b60a0b6bf971bafadb8616c20b0d76e1b3
|
|
| MD5 |
56f5905013c70b6764f193f463323343
|
|
| BLAKE2b-256 |
dee487bc0b3ecc03e1b1404ecbd783c043601f9b3dcde4eb4504ba87f8059897
|