Skip to main content

Use the ChatNoir search engine in PyTerrier.

Project description

PyPi CI Code coverage Python Google Colab Issues Commit activity Downloads License

🔍 chatnoir-pyterrier

Use the ChatNoir REST-API in PyTerrier for retrieval/re-ranking against large corpora such as ClueWeb09, ClueWeb12, ClueWeb22, or MS MARCO.

Powered by the chatnoir-api package.

Installation

Install the package from PyPI:

pip install chatnoir-pyterrier

Usage

You can use the ChatNoirRetrieve PyTerrier module in any PyTerrier pipeline, like you would do with BatchRetrieve.

from chatnoir_pyterrier import ChatNoirRetrieve

chatnoir = ChatNoirRetrieve(index="msmarco-document-v2.1")
chatnoir.search("python library")

Features

ChatNoir provides an extensive set of extra features, such as the full text or page rank / spam rank (for some indices). These can easily be included in the response data frame for usage in subsequent PyTerrier re-ranking stages like so:

from chatnoir_pyterrier import ChatNoirRetrieve, Feature

chatnoir_msmarco_snippet = ChatNoirRetrieve(index="msmarco-document-v2.1", features=Feature.SNIPPET_TEXT)
chatnoir_msmarco_snippet.search("python library")

chatnoir_cw09_page_spam_rank = ChatNoirRetrieve(index="clueweb09", features=Feature.PAGE_RANK | Feature.SPAM_RANK)
chatnoir_cw09_page_spam_rank.search("python library")

Caching

We recommend wrapping ChatNoirRetrieve in a RetrieverCache, using the pyterrier-caching library:

from chatnoir_pyterrier import ChatNoirRetrieve
from pyterrier_caching import RetrieverCache

chatnoir = ChatNoirRetrieve(index="msmarco-document-v2.1")
cached_chatnoir = RetrieverCache("path/to/cache", chatnoir)

This way, the ChatNoir API is called only once per query, and subsequent experiments can use the cached results. Refer to the pyterrier-caching documentation for more details on how the caching works.

Advanced usage

Please check out our sample notebook or open it in Google Colab.

We also provide a hands-on guide for the Touché 2023 shared tasks here.

Citation

If you use this package, please cite the paper from the ChatNoir authors. You can use the following BibTeX information for citation:

@InProceedings{bevendorff:2018,
  address =   {Berlin Heidelberg New York},
  author =    {Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
  booktitle = {Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018)},
  editor =    {Leif Azzopardi and Allan Hanbury and Gabriella Pasi and Benjamin Piwowarski},
  month =     mar,
  publisher = {Springer},
  series =    {Lecture Notes in Computer Science},
  site =      {Grenoble, France},
  title =     {{Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl}},
  year =      2018
}
@InProceedings{merker:2025a,
  address =   {Cham, Switzerland},
  author =    {Jan Heinrich Merker and Janek Bevendorff and Maik Fr{\"o}be and Tim Hagen and Harrisen Scells and Matti Wiegmann and Benno Stein and Matthias Hagen and  Martin Potthast},
  booktitle = {Advances in Information Retrieval. 47th European Conference on IR Research (ECIR 2025)},
  doi =       {10.1007/978-3-031-88720-8_17},
  editor =    {Claudia Hauff and Craig Macdonal and Dietmar Jannach and Gabriella Kazai and Franco Maria Nardini and Fabio Pinelli and Fabrizio Silvestri and Nicola Tonellotto},
  month =     apr,
  pages =     {96--104},
  publisher = {Springer Nature},
  series =    {Lecture Notes in Computer Science},
  site =      {Lucca, Italy},
  title =     {{Web-scale Retrieval Experimentation with chatnoir-pyterrier}},
  volume =    15576,
  year =      2025
}

Experiments

With chatnoir-pyterrier, it is easy to run benchmarks on a number of shared tasks that run on larger document collections. We demonstrate this by running ChatNoir retrieval on all suported TREC, CLEF, and NTCIR shared tasks available in ir_datasets.

First install the experiment dependencies:

pip install -e .[experiment]

To run the experiments, first create the runs by running:

ray job submit --runtime-env examples/ray-runtime-env.yml --no-wait -- python examples/experiment.py 

This will create runs for each shared task in parallel and save it to a cache.

After creating the runs, the experiment.ipynb notebook can be used to analyze the results.

Indexing

Head over to the ChatNoir ir_datasets indexer to learn more on how new ir_datasets-compatible datasets are indexed into ChatNoir.

Development

To build this package and contribute to its development you need to install the build, and setuptools and wheel packages:

pip install build setuptools wheel

(On most systems, these packages are already pre-installed.)

Development installation

Install package and test dependencies:

pip install -e .[test]

Testing

Configure the API keys for testing:

export CHATNOIR_API_KEY="<API_KEY>"

Verify your changes against the test suite to verify.

ruff check .                   # Code format and LINT
mypy .                         # Static typing
bandit -c pyproject.toml -r .  # Security
pytest .                       # Unit tests

Please also add tests for your newly developed code.

Build wheels

Wheels for this package can be built with:

python -m build

Support

If you hit any problems using this package, please file an issue. We're happy to help!

License

This repository is released under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatnoir_pyterrier-3.3.0.tar.gz (35.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chatnoir_pyterrier-3.3.0-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file chatnoir_pyterrier-3.3.0.tar.gz.

File metadata

  • Download URL: chatnoir_pyterrier-3.3.0.tar.gz
  • Upload date:
  • Size: 35.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chatnoir_pyterrier-3.3.0.tar.gz
Algorithm Hash digest
SHA256 b3d8b628c0ecabd3b354f162c0b4c266d537b1458ce3c7b9665587bec209cd11
MD5 7920f4d7c7a535af4196fb25c38f5687
BLAKE2b-256 3b1b9a66e519d11918636fae6bbb927620ef44f49ceaf480b83f5d0597e669a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for chatnoir_pyterrier-3.3.0.tar.gz:

Publisher: ci.yml on chatnoir-eu/chatnoir-pyterrier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chatnoir_pyterrier-3.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chatnoir_pyterrier-3.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25d78b70974302c9b6b7a7068af8887d94ee9769e039df9f61dd1f4cddf2ce1c
MD5 c7c1830c848ef0b864ae192c9df8cd33
BLAKE2b-256 3f7ad93592dc5bd298f9f05b8746287ab1d07b3083474769ce8a2fcb53551b53

See more details on using hashes here.

Provenance

The following attestation bundles were made for chatnoir_pyterrier-3.3.0-py3-none-any.whl:

Publisher: ci.yml on chatnoir-eu/chatnoir-pyterrier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page