Common lightweight PyTerrier adapters for specific artifact shapes - query and document processors for modular IR pipelines

Project description

MP-Procs: Modular Pipeline Query and Document Processors

A PyTerrier-compatible package providing modular query and document processing components for information retrieval pipelines.

Overview

MP-Procs offers a collection of transformers that can be seamlessly integrated into PyTerrier pipelines to enhance query processing and document indexing. The package is designed for experimental IR research, allowing researchers to easily combine and evaluate different processing strategies.

Features

Query Processors (`mp_procs.qproc`)

Segmentation-based Query Enhancement
- weighted_segmentation_boost: Boost terms in multi-word segments
- append_segmentation_with_or: Append segments using logical OR with #syn/#band operators
- synonym_segmentation: Treat segments as synonym groups with curly braces
Query Intelligence
- intent_trigger_weighted: Add intent-specific trigger phrases based on predicted query intent
- single_rare_term_emphasis_weighted: Emphasize rare terms based on IDF statistics
Text Preprocessing
- sanitize_column_transform: Remove special characters and clean query text

Document Processors (`mp_procs.dproc`)

Query Generation
- append_query_gen: Generate additional queries using DocT5Query
Content Enhancement
- process_keyphrases: Extract and append keyphrases to document content

Installation

# Install from local source
cd /path/to/thesis_code/mp_procs
pip install -e .

Requirements

python-terrier >= 0.10.0
pyterrier-alpha
pandas

Pipeline Integration

Complete IR Pipeline Example

import pyterrier as pt
import pyterrier-alpha as pta
from mp_procs.qproc import weighted_segmentation_boost, sanitize_column_transform
from mp_procs.dproc import process_keyphrases

# Load dataset
dataset = pt.get_dataset("irds:trec-robust04")
topics, qrels = dataset.get_topics("title"), dataset.get_qrels()

# Query processing
query_artifact = pta.Artifact.from_url("tira:disks45/nocr/trec-robust-2004/ows/query-segmentation-hyb-a")
query_enhancer = weighted_segmentation_boost(boost_weight=1.2)
query_pipeline = query_artifact >> query_enhancer
topics = query_pipeline(topics)

# Document processing and indexing
keyphrase_proc = process_keyphrases(artifact=keyphrase_artifact, repeat=2)
indexer = pt.IterDictIndexer("./enhanced_index", overwrite=True, meta={"docno": 20})
index_ref = (keyphrase_proc >> indexer).index(dataset.get_corpus_iter())

# Retrieval
retriever = pt.terrier.Retriever(index_ref, wmodel="BM25")

# Complete pipeline
pipeline = query_pipeline >> retriever

# Evaluate
results = pipeline.transform(topics)
evaluation = pt.Evaluate(results, qrels, metrics=["map", "ndcg_cut_10"])
print(evaluation)

Configuration Options

Query Processors

Function	Key Parameters	Description
`weighted_segmentation_boost`	`boost_weight=1.2`, `seg_col="segmentation"`	Boost factor for segment terms
`append_segmentation_with_or`	`seg_col="segmentation"`	Use #syn/#band operators
`synonym_segmentation`	`seg_col="segmentation"`	Create {synonym groups}
`intent_trigger_weighted`	`trigger_weight=1.5`, `intent_col="intent_prediction"`	Weight for trigger phrases
`single_rare_term_emphasis_weighted`	`emphasis_weight=1.5`, `avg_idf_low=5.0`	Rare term boosting thresholds
`sanitize_column_transform`	`source_col="query"`, `target_col="query"`	Column mapping for cleaning

Document Processors

Function	Key Parameters	Description
`append_query_gen`	`repeat=1`	Number of generated queries to append
`process_keyphrases`	`repeat=1`	Number of keyphrase extractions

Entry Points

The package registers entry points for automatic discovery:

[project.entry-points."modpipe.qproc"]
SegmentationWeighted = "mp_procs.qproc:weighted_segmentation_boost"
SegmentationAppendOr = "mp_procs.qproc:append_segmentation_with_or"
SegmentationSynonyms = "mp_procs.qproc:synonym_segmentation"
PredictorBoost = "mp_procs.qproc:single_rare_term_emphasis_weighted"
IntentTriggerAppend = "mp_procs.qproc:intent_trigger_weighted"
SpellingModification = "mp_procs.qproc:sanitize_column_transform"

[project.entry-points."modpipe.dproc"]
DocT5Query = "mp_procs.dproc:append_query_gen"
KeyphraseExtraction = "mp_procs.dproc:process_keyphrases"

Testing

Run the test suite:

cd mp_procs
python -m pytest tests/ -v

Or run specific tests:

python -m unittest tests.test_queryprocessor.TestQueryProcessors.test_weighted_segmentation_boost_basic

Error Handling

All processors gracefully handle missing columns and invalid data:

Missing required columns → return DataFrame unchanged
Empty/null values → skip processing for those rows
Invalid data types → attempt conversion or skip gracefully
Chain compatibility → all processors accept and return pandas DataFrames

Contributing

This package is part of a research thesis on modular IR pipelines. For issues or contributions:

Follow PyTerrier transformer conventions
Add comprehensive tests for new processors
Update entry points in pyproject.toml
Maintain backward compatibility

License

This project is part of academic research. Please cite appropriately if used in publications.

Citation

If you use this package in your research, please cite:

@misc{mp_procs2025,
  title={MP-Procs: Modular Pipeline Processors for Information Retrieval},
  author={[Patrick Stahl]},
  year={2025},
  note={Thesis code for modular IR pipeline research}
}

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Sep 28, 2025

0.1.0

Aug 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mp_procs-0.1.1.tar.gz (11.3 kB view details)

Uploaded Sep 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mp_procs-0.1.1-py3-none-any.whl (8.5 kB view details)

Uploaded Sep 28, 2025 Python 3

File details

Details for the file mp_procs-0.1.1.tar.gz.

File metadata

Download URL: mp_procs-0.1.1.tar.gz
Upload date: Sep 28, 2025
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for mp_procs-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`00cee57215133f6e73c339fd6eb6a6c9be9f05d17d44d1e67124b9127dca585e`
MD5	`42b13b3ff5012995c45523baeb9cdf19`
BLAKE2b-256	`14cd4f4c0d1acef736f0f0c65803eae6613b6e877620ec886989f54b1c01a9f2`

See more details on using hashes here.

File details

Details for the file mp_procs-0.1.1-py3-none-any.whl.

File metadata

Download URL: mp_procs-0.1.1-py3-none-any.whl
Upload date: Sep 28, 2025
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for mp_procs-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8c1af70578aecce587963335597d9b60a0b6bf971bafadb8616c20b0d76e1b3`
MD5	`56f5905013c70b6764f193f463323343`
BLAKE2b-256	`dee487bc0b3ecc03e1b1404ecbd783c043601f9b3dcde4eb4504ba87f8059897`

See more details on using hashes here.

mp-procs 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project description

MP-Procs: Modular Pipeline Query and Document Processors

Overview

Features

Query Processors (`mp_procs.qproc`)

Document Processors (`mp_procs.dproc`)

Installation

Requirements

Pipeline Integration

Complete IR Pipeline Example

Configuration Options

Query Processors

Document Processors

Entry Points

Testing

Error Handling

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

mp-procs 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project description

MP-Procs: Modular Pipeline Query and Document Processors

Overview

Features

Query Processors (mp_procs.qproc)

Document Processors (mp_procs.dproc)

Installation

Requirements

Pipeline Integration

Complete IR Pipeline Example

Configuration Options

Query Processors

Document Processors

Entry Points

Testing

Error Handling

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Query Processors (`mp_procs.qproc`)

Document Processors (`mp_procs.dproc`)