Skip to main content

Common lightweight PyTerrier adapters for specific artifact shapes - query and document processors for modular IR pipelines

Project description

MP-Procs: Modular Pipeline Query and Document Processors

A PyTerrier-compatible package providing modular query and document processing components for information retrieval pipelines.

Overview

MP-Procs offers a collection of transformers that can be seamlessly integrated into PyTerrier pipelines to enhance query processing and document indexing. The package is designed for experimental IR research, allowing researchers to easily combine and evaluate different processing strategies.

Features

Query Processors (mp_procs.qproc)

  • Segmentation-based Query Enhancement

    • weighted_segmentation_boost: Boost terms in multi-word segments
    • append_segmentation_with_or: Append segments using logical OR with #syn/#band operators
    • synonym_segmentation: Treat segments as synonym groups with curly braces
  • Query Intelligence

    • intent_trigger_weighted: Add intent-specific trigger phrases based on predicted query intent
    • single_rare_term_emphasis_weighted: Emphasize rare terms based on IDF statistics
  • Text Preprocessing

    • sanitize_column_transform: Remove special characters and clean query text

Document Processors (mp_procs.dproc)

  • Query Generation

    • append_query_gen: Generate additional queries using DocT5Query
  • Content Enhancement

    • process_keyphrases: Extract and append keyphrases to document content

Installation

# Install from local source
cd /path/to/thesis_code/mp_procs
pip install -e .

Requirements

  • python-terrier >= 0.10.0
  • pyterrier-alpha
  • pandas

Pipeline Integration

Complete IR Pipeline Example

import pyterrier as pt
import pyterrier-alpha as pta
from mp_procs.qproc import weighted_segmentation_boost, sanitize_column_transform
from mp_procs.dproc import process_keyphrases

# Load dataset
dataset = pt.get_dataset("irds:trec-robust04")
topics, qrels = dataset.get_topics("title"), dataset.get_qrels()

# Query processing
query_artifact = pta.Artifact.from_url("tira:disks45/nocr/trec-robust-2004/ows/query-segmentation-hyb-a")
query_enhancer = weighted_segmentation_boost(boost_weight=1.2)
query_pipeline = query_artifact >> query_enhancer
topics = query_pipeline(topics)

# Document processing and indexing
keyphrase_proc = process_keyphrases(artifact=keyphrase_artifact, repeat=2)
indexer = pt.IterDictIndexer("./enhanced_index", overwrite=True, meta={"docno": 20})
index_ref = (keyphrase_proc >> indexer).index(dataset.get_corpus_iter())

# Retrieval
retriever = pt.terrier.Retriever(index_ref, wmodel="BM25")

# Complete pipeline
pipeline = query_pipeline >> retriever

# Evaluate
results = pipeline.transform(topics)
evaluation = pt.Evaluate(results, qrels, metrics=["map", "ndcg_cut_10"])
print(evaluation)

Configuration Options

Query Processors

Function Key Parameters Description
weighted_segmentation_boost boost_weight=1.2, seg_col="segmentation" Boost factor for segment terms
append_segmentation_with_or seg_col="segmentation" Use #syn/#band operators
synonym_segmentation seg_col="segmentation" Create {synonym groups}
intent_trigger_weighted trigger_weight=1.5, intent_col="intent_prediction" Weight for trigger phrases
single_rare_term_emphasis_weighted emphasis_weight=1.5, avg_idf_low=5.0 Rare term boosting thresholds
sanitize_column_transform source_col="query", target_col="query" Column mapping for cleaning

Document Processors

Function Key Parameters Description
append_query_gen repeat=1 Number of generated queries to append
process_keyphrases repeat=1 Number of keyphrase extractions

Entry Points

The package registers entry points for automatic discovery:

[project.entry-points."modpipe.qproc"]
SegmentationWeighted = "mp_procs.qproc:weighted_segmentation_boost"
SegmentationAppendOr = "mp_procs.qproc:append_segmentation_with_or"
SegmentationSynonyms = "mp_procs.qproc:synonym_segmentation"
PredictorBoost = "mp_procs.qproc:single_rare_term_emphasis_weighted"
IntentTriggerAppend = "mp_procs.qproc:intent_trigger_weighted"
SpellingModification = "mp_procs.qproc:sanitize_column_transform"

[project.entry-points."modpipe.dproc"]
DocT5Query = "mp_procs.dproc:append_query_gen"
KeyphraseExtraction = "mp_procs.dproc:process_keyphrases"

Testing

Run the test suite:

cd mp_procs
python -m pytest tests/ -v

Or run specific tests:

python -m unittest tests.test_queryprocessor.TestQueryProcessors.test_weighted_segmentation_boost_basic

Error Handling

All processors gracefully handle missing columns and invalid data:

  • Missing required columns → return DataFrame unchanged
  • Empty/null values → skip processing for those rows
  • Invalid data types → attempt conversion or skip gracefully
  • Chain compatibility → all processors accept and return pandas DataFrames

Contributing

This package is part of a research thesis on modular IR pipelines. For issues or contributions:

  1. Follow PyTerrier transformer conventions
  2. Add comprehensive tests for new processors
  3. Update entry points in pyproject.toml
  4. Maintain backward compatibility

License

This project is part of academic research. Please cite appropriately if used in publications.

Citation

If you use this package in your research, please cite:

@misc{mp_procs2025,
  title={MP-Procs: Modular Pipeline Processors for Information Retrieval},
  author={[Patrick Stahl]},
  year={2025},
  note={Thesis code for modular IR pipeline research}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mp_procs-0.1.1.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mp_procs-0.1.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file mp_procs-0.1.1.tar.gz.

File metadata

  • Download URL: mp_procs-0.1.1.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for mp_procs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 00cee57215133f6e73c339fd6eb6a6c9be9f05d17d44d1e67124b9127dca585e
MD5 42b13b3ff5012995c45523baeb9cdf19
BLAKE2b-256 14cd4f4c0d1acef736f0f0c65803eae6613b6e877620ec886989f54b1c01a9f2

See more details on using hashes here.

File details

Details for the file mp_procs-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mp_procs-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for mp_procs-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8c1af70578aecce587963335597d9b60a0b6bf971bafadb8616c20b0d76e1b3
MD5 56f5905013c70b6764f193f463323343
BLAKE2b-256 dee487bc0b3ecc03e1b1404ecbd783c043601f9b3dcde4eb4504ba87f8059897

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page