Skip to main content

Cross Language Information Retrieval pipeline

Project description

Patapsco - the SCALE 2021 Pipeline

Requirements

Patapsco requires Python 3.6+ and Java 11+.

Installing Patapsco with Anaconda will add Java into the virtual environment. If not using Anaconda, you will need to check your Java version or enable the java module on the grid.

To check your Java version:

javac --version

On the grid, enable Java with:

module add java

Install

Create a Python virtual environment using venv or conda.

With conda

Installing with conda is recommended and will install the gpu-enabled version of pytorch. As of June 2021, CUDA 11.1.1 will be installed into the environment by default. You do not need to load any CUDA modules on the grid to use the GPUs.

Create and activate the conda environment:

conda env create --file environment.yml
conda activate patapsco

Install Patapsco:

pip install --editable .

With Python's venv module

Create and activate the virtual environment:

python3 -m venv venv
source venv/bin/activate

You may need to upgrade your pip:

pip install -U pip
pip install -U wheel

Install Patapsco and its dependencies:

pip install --editable .

Note: python virtual environments do not work properly on the HLTCOE grid!

Windows users

If you do not have a C++ compiler or cannot install pytrec_eval, then comment out the lines in environment.yml and setup.py that specify pytrec_eval. Example:

  - pip:
    - pyserini
    # - pytrec_eval

You will be able to run Patapsco, but not score your runs.

Design

Patapsco was designed to create CLIR runs and not for training CLIR components (like reranking models). It is expected that the artifacts generated by Patapsco could be used for training, but that the training happens outside of Patapsco.

Patapsco consists of two pipelines:

  • Stage 1: creates an index from the documents
  • Stage 2: retrieves results for queries from the indexes and reranks the results

A pipeline consists of a sequence of tasks.

  • Stage 1 tasks:
    • text processing of documents (character normalization, tokenization, etc.)
    • indexing
  • Stage 2 tasks:
    • extract query from topic
    • text processing of query (same as document processing)
    • retrieval of results
    • reranking of results
    • scoring

When a run is complete, its output is written to a run directory. Tasks also store artifacts in the run directory that can be used for other runs. For example, an index created in one run can be used in another.

Patapsco can run partial pipelines. For example, a user can run just stage 1 to generate an index. Or a user may run only stage 2 and have it start with processed queries and a prebuilt index.

Configuration

Patapsco uses YAML or JSON files for configuration. The stage 1 and stage 2 pipelines are built from the configuration. The output including any artifacts (like processed queries or an index) are stored in a run directory. For more information on configuration, see docs/config.md.

Running

After installing Patapsco, a sample run is started with:

patapsco samples/configs/eng_basic.yml

By default, the output for the run is written to a runs directory in the working directory. If a run is complete, Patapsco will not overwrite it.

To turn on more detailed logging and full exception stack traces, use the debug flag:

patapsco --debug samples/configs/eng_basic.yml

Any variable in the configuration can be overriden on the command line:

patapsco --set run.name=my_test_run samples/configs/eng_basic.yml

Submitting Results

A run's output file plus the configuration used to generate the run can be submitted at the website: https://scale21.org

Bug Reports

Use issues on Gitlab to report bugs or request new features. For a bug report include

  • a description of what was expected and what actually happened
  • any stack trace or error message
  • the configuration file if the bug only happens with that configuration

Development

Developers should install Patapsco in editable mode along with development dependencies:

pip install -e .[dev]

Unit Tests

To run the unit tests, run:

pytest

Some tests load models and are normally skipped. To run those:

pytest --runslow

Code Style

The code should conform to the PEP8 style except for leniency on line length.

To update the code, you can use autopep8. To run it on a file:

autopep8 -i [path to file]

To test PEP8 compliance, run:

flake8 patapsco

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patapsco-0.9.7.tar.gz (127.8 kB view details)

Uploaded Source

Built Distribution

patapsco-0.9.7-py3-none-any.whl (137.4 kB view details)

Uploaded Python 3

File details

Details for the file patapsco-0.9.7.tar.gz.

File metadata

  • Download URL: patapsco-0.9.7.tar.gz
  • Upload date:
  • Size: 127.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.2.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.6

File hashes

Hashes for patapsco-0.9.7.tar.gz
Algorithm Hash digest
SHA256 c46b4144bc28b347690804dc50671da7ad404700f4321f99eee22c345c774ee7
MD5 c839f77b7ad9ef877fbb40364ad2f6c7
BLAKE2b-256 b8f2aed91d5c4e6964a4258666a8b6dc7bcdda2fc9112ead6b3159bf40681a81

See more details on using hashes here.

File details

Details for the file patapsco-0.9.7-py3-none-any.whl.

File metadata

  • Download URL: patapsco-0.9.7-py3-none-any.whl
  • Upload date:
  • Size: 137.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.2.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.6

File hashes

Hashes for patapsco-0.9.7-py3-none-any.whl
Algorithm Hash digest
SHA256 35ab8a9887fa137aca12c424e8c4a4c689631e4d5c5dc9ae2dbbb3b5abac738e
MD5 3cfd689deb149600824dfc134c4c521e
BLAKE2b-256 42dd40fa6ac8862d4a59a54c62d2d65981c2b86510a282f7986c2d648cb7c853

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page