Skip to main content

Caching components for PyTerrier

Project description

Caching for PyTerrier

pyterrier-caching provides several components for caching intermediate results.

The right component will depend on your use case.

Installation

Install this package using pip:

pip install pyterrier-caching

Caching Results when Indexing

IndexerCache saves the sequence of documents encountered in an indexing pipeline. It allows you to repeat that sequence, without needing to re-execute the computations up to that point.

Example use case: I want to test how different retrieval engines perform over learned sparse representations, but I don't want to re-compute the representations each time.

You use an IndexerCache the same way you would use an indexer: as the last component of a pipeline. Rather than building an index of the data, the IndexerCache will save your results to a file on disk. This file can be re-read by iterating over the cache object with iter(cache).

Example:

import pyterrier as pt
pt.init()
from pyterrier_caching import IndexerCache

# Setup
cache = IndexerCache('path/to/cache')
dataset = pt.get_dataset('some-dataset') # e.g., 'irds:msmarco-passage'

# Use the IndexerCache cache object just as you would an indexer
cache_pipeline = MyExpensiveTransformer() >> cache

# The following line will save the results of MyExpensiveTransformer() to path/to/cache
cache_pipeline.index(dataset.get_corpus_iter())

# Now you can build multiple indexes over the results of MyExpensiveTransformer without
# needing to re-run it each time
indexer1 = ... # e.g., pt.IterDictIndexer('./path/to/index.terrier')
indexer1.index(iter(cache))
indexer2 = ... # e.g., pyterrier_pisa.PisaIndex('./path/to/index.pisa')
indexer2.index(iter(cache))

Concrete Examples:

👁‍ More Details

IndexerCache currently has one implementation, Lz4PickleIndexerCache, which is set as the default. Lz4PickleIndexerCache saves the sequence as a sequence of LZ4-compressed pickled dicts in the file: data.pkl.lz4. Byte-level offsets for each document are stored as a numpy-compatible float64 array in offsets.np. If the docno column is present, an npids structure is also stored, facilitating reverse-lookups of documents by their docno.

Caching Results from a Scorer

ScorerCache saves the score based on query and docno. When the same query-docno combination is encountered again, the value is read from the cache, avoiding re-computation.

Example use case: I want to test a neural relevance model over several first-stage retrieval models, but they bring back many of the same documents, so I don't want to re-compute the scores each time.

You use a ScorerCache in place of the scorer in a pipeline. It holds a reference to the scorer so that it can compute values that are missing from the cache. You need to "build" a ScorerCache before you can use it, which creates an internal mapping between the string docno and the integer indexes at which the scores values are stored.

⚠️ Important Caveats:

  • ScorerCache saves scores based on only the value of the query and docno. All other information is ignored (e.g., the text of the document). Note that this strategy makes it suitable only when each score only depends on the query and docno of a single record (e.g., Mono-style models) and not cases that perform pairwise or listwise scoring (e.g, Duo-style models).
  • ScorerCache only stores the result of the score column. All other outputs of the scorer will be discarded. (Rank is also outputed, but is caculated by ScorerCache, not the scorer.)
  • Due to limitations of HDF5, only a single process can have the cache open for writing at a time.
  • Scores are saved as float32 values. Other values will attempted to be cast up/down to float32.
  • The value of nan is reserved as an indicator that the value is missing from the cache. Scorers should not return this value.
  • A ScorerCache represents the cross between a scorer and a corpus. Do not try to use a single cache across multiple scorers or corpora -- you'll get unexpected/invalid results.

Example:

import pyterrier as pt
pt.init()
from pyterrier_caching import ScorerCache

# Setup
cached_scorer = ScorerCache('path/to/cache', MyExpensiveScorer())
dataset = pt.get_dataset('some-dataset') # e.g., 'irds:msmarco-passage'

# You need to build your cache before you can use it. There are several
# ways to do this:
if not cached_scorer.built():
    # Easiest:
    cached_scorer.build(dataset.get_corpus_iter())
    # If you already have an "npids" file to map the docnos to indexes, you can use:
    # >>> cached_scorer.build(docnos_file='path/to/docnos.npids')
    # This will be faster than iterating over the entire corpus, especially for
    # large datasets.

# Use the ScorerCache cache object just as you would a scorer
cached_pipeline = MyFirstStage() >> cached_scorer

cached_pipeline(dataset.get_topics())
# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())

# Will only compute scores for docnos that were not returned by MyFirstStage()
another_cached_pipeline = AnotherFirstStage() >> cached_scorer
another_cached_pipeline(dataset.get_topics())

Concrete Examples:

👁‍ More Details

ScorerCache currently has one implementation, Hdf5ScorerCache, which is set as the default. Hdf5ScorerCache saves scores in an HDF5 file.

Caching Results from a Retriever

RetrieverCache saves the retrieved results based on the fields of each row. When the same row is encountered again, the value is read from the cache, avoiding retrieving again.

Example use case: I want to test several different re-ranking models over the same initial set of documents, and I want to save time by not re-running the queries each time.

You use a RetrieverCache in place of the retriever in a pipeline. It holds a reference to the retriever so that it can retrieve results for queries that are missing from the cache.

⚠️ Important Caveats:

  • RetrieverCache saves scores based on all the input columns by default. Changes in any of the values will result in a cache miss, even if the column does not affect the retriever's output. You can specify a subset of columns using the on parameter.
  • DBM does not support concurrent reads/writes from multiple threads or processes. Keep only a single RetrieverCache pointing to a cache file location open at a time.
  • A ScorerCache represents the cross between a retriever and a corpus. Do not try to use a single cache across multiple retrievers or corpora -- you'll get unexpected/invalid results.

Example:

import pyterrier as pt
pt.init()
from pyterrier_caching import RetrieverCache

# Setup
cached_retriever = RetrieverCache('path/to/cache', MyRetriever())
dataset = pt.get_dataset('some-dataset') # e.g., 'irds:msmarco-passage'

# Use the RetrieverCache cache object just as you would a retriever
cached_pipeline = cached_retriever >> MySecondStage()

cached_pipeline(dataset.get_topics())
# Will be faster when you run it a second time, since all values are cached
cached_pipeline(dataset.get_topics())

Concrete Examples:

👁‍ More Details

RetrieverCache currently has one implementation, DbmScorerCache, which is set as the default. DbmScorerCache saves results as a dbm file.

Extras

You load caches from HuggingFace Hub and push caches to HuggingFace Hub using .from_hf('id') and .to_hf('id'). Example:

from pyterrier_caching import ScorerCache
cache = ScorerCache.from_hf('macavaney/msmarco-passage.monot5-base.cache')
cache.to_hf('username/dataset')

The following components are not caching per se, but can be helpful when constructing a caching pipeline.

Lazy(...) allows you to build a transformer object that is only initialized when it is first executed. This can help avoid the expensive process of reading and loading a model that may never be executed due to caching.

For example, this example uses Lazy with a ScorerCache to avoid loading MyExpensiveTransformer unless it's actually needed:

from pyterrier_caching import Lazy, ScorerCache

lazy_transformer = Lazy(lambda: MyExpensiveTransformer())
cache = ScorerCache('path/to/cache', lazy_transformer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyterrier_caching-0.1.1.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

pyterrier_caching-0.1.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file pyterrier_caching-0.1.1.tar.gz.

File metadata

  • Download URL: pyterrier_caching-0.1.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pyterrier_caching-0.1.1.tar.gz
Algorithm Hash digest
SHA256 35b6e47eaf832504c25e2b52268bb51fe48306018d989e45d7d483ba109298cb
MD5 b3ca85b6d164cc4737aa22684e255d4d
BLAKE2b-256 b922f101a68ab92a778eae64580a4647886bffd38a1af3f12e8e8a20cf8c88ed

See more details on using hashes here.

File details

Details for the file pyterrier_caching-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pyterrier_caching-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 18a01049b9e29998078a6296c45e7c6f4dd12715e2dd812ba2f7a82fb288e673
MD5 21673c6af00ad8e17876c810f26ff39b
BLAKE2b-256 5fca529a257132aec5197e53c5ccbf358b36bf84161fbb62361acbcd237281be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page