PyTerrier components for doc2query

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Text Processing
- Text Processing :: Indexing

Project description

PyTerrier_doc2query

New: Check out our interactive demo on 🤗 HuggingFace Spaces

New: Improve effectiveness and efficiency using Doc2Query−−

This is the PyTerrier plugin for the docTTTTTquery [Nogueira20] and Doc2Query−− [Gospodinov23] approaches for document expansion by query prediction.

Installation

This repostory can be installed using Pip.

pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git

What does it do?

A Doc2Query object has a transform() function, which takes the text of each document, and suggests questions for that text.

sample_doc = "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated"

import pyterrier_doc2query
doc2query = pyterrier_doc2query.Doc2Query()
doc2query([{"docno" : "d1", "text" : sample_doc}])

The resulting dataframe will have an additional "querygen" column, which contains the generated queries, such as:

docno	querygen
"d1"	'what was the importance of the manhattan project to the united states atom project? what influenced the success of the united states why was the manhattan project a success? why was it important'

As a PyTerrier transformer, there are lots of ways to introduce Doc2query into a PyTerrier retrieval process.

By default, the plugin loads macavaney/doc2query-t5-base-msmarco, which is a a version of the checkpoint released by the original authors, converted to pytorch format. You can load another T5 model by passing another huggingface model name (or path to model on the file system) by passing it as the first argument:

doc2query = pyterrier_doc2query.Doc2Query('some/other/model')

Using Doc2Query for Indexing

Then, indexing is as easy as instantiating the Doc2Query object and a PyTerrier indexer:

import pyterrier as pt
dataset = pt.get_dataset("irds:vaswani")
import pyterrier_doc2query
doc2query = pyterrier_doc2query.Doc2Query(append=True) # append generated queries to the orignal document text
indexer = doc2query >> pt.IterDictIndexer(index_loc)
indexer.index(dataset.get_corpus_iter())

Doc2Query−−: When Less is More

The performance of Doc2Query can be significantly improved by removing queries that are not relevant to the documents that generated them. This involves first scoring the generated queries (using QueryScorer) and then filtering out the least relevant ones (using QueryFilter).

from pyterrier_doc2query import Doc2Query, QueryScorer, QueryFilter
from pyterrier_dr import ElectraScorer

doc2query = Doc2Query(append=False, num_samples=5)
scorer = ElectraScorer()
indexer = pt.IterDictIndexer('./index')
pipeline = doc2query >> QueryScorer(scorer) >> QueryFilter(t=3.21484375) >> indexer # t=3.21484375 is the 70th percentile for generated queries on MS MARCO

pipeline.index(dataset.get_corpus_iter())

We've also released pre-computed filter scores for various models on HuggingFace datasets:

Using Doc2Query for Retrieval

Doc2query can also be used at retrieval time (i.e. on retrieved documents) rather than at indexing time.

import pyterrier_doc2query
doc2query = pyterrier_doc2query.Doc2Query()

dataset = pt.get_dataset("irds:vaswani")
bm25 = pt.terrier.Retriever.from_dataset("vaswani", "terrier_stemmed", wmodel="BM25")
bm25 >> pt.get_text(dataset) >> doc2query >> pt.text.scorer(body_attr="querygen", wmodel="BM25")

Examples

Check out out the notebooks, even on Colab:

Vaswani [Github] [Colab]

Implementation Details

We use a PyTerrier transformer to rewrite documents by doc2query.

References

[Nogueira20]: Rodrigo Nogueira and Jimmy Lin. From doc2query to docTTTTTquery. https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf
[Gospodinov23]: Mitko Gospodinov, Sean MacAvaney, and Craig Macdonald. Doc2Query--: When Less is More. ECIR 2023.
[Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271

Credits

Craig Macdonald, University of Glasgow
Sean MacAvaney, University of Glasgow
Mitko Gospodinov, University of Glasgow

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Text Processing
- Text Processing :: Indexing

Release history Release notifications | RSS feed

0.2.0

Oct 10, 2025

0.1.1

Dec 14, 2024

This version

0.1.0

Dec 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyterrier_doc2query-0.1.0.tar.gz (10.6 kB view details)

Uploaded Dec 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyterrier_doc2query-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Dec 12, 2024 Python 3

File details

Details for the file pyterrier_doc2query-0.1.0.tar.gz.

File metadata

Download URL: pyterrier_doc2query-0.1.0.tar.gz
Upload date: Dec 12, 2024
Size: 10.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyterrier_doc2query-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`84a7786364f3a89edfbcee46191b368d44fb214f20105a4e58885eed7d7ed09d`
MD5	`e6e4bd8e842dfdb567c99e13eae90701`
BLAKE2b-256	`e98e0f6c6ad85b7460e0ba8b1d44f36fd816a7b61ef9b89e902c7972a72ceaad`

See more details on using hashes here.

File details

Details for the file pyterrier_doc2query-0.1.0-py3-none-any.whl.

File metadata

Download URL: pyterrier_doc2query-0.1.0-py3-none-any.whl
Upload date: Dec 12, 2024
Size: 11.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyterrier_doc2query-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a69919c8d23f47fbe7ee7546800f93b6fe479461988a2a1b60aedaaa5feaf64b`
MD5	`0498b114000d091475088cd6cc1f61de`
BLAKE2b-256	`e7c9e48a1730ece300c38b439d81ba9ac835509c587d1adecbd104dd9fce4c23`

See more details on using hashes here.

pyterrier-doc2query 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyTerrier_doc2query

Installation

What does it do?

Using Doc2Query for Indexing

Doc2Query−−: When Less is More

Using Doc2Query for Retrieval

Examples

Implementation Details

References

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes