Skip to main content

Efficient Implementation of Probabilistic Structured Queries

Project description

Efficient Implementation of Probabilistic Structured Queries

This package is an implementation of the Probablistic Structured Queries algorithm for cross-langauge information retrieval. It leverages alignment table from statistical machine translation to translate the document bag-of-tokens into the query language.

Raw translation tables are available on Huggingface Models hltcoe/psq_translation_tables

Get started

fast_psq is available on PyPI.

pip install fast_psq

Alternatively, you can also install directly from the GitHub main branch by using the following command.

pip install pip@git+https://github.com/hltcoe/PSQ

fast_psq works with ir_datasets and ir_measures quite well for accessing IR evaluation collections and evaluating results. You can install the two packages with the following command.

pip install ir_datasets ir_measures

Indexing

The indexing script takes a translation table (i.e., alignment matrix) and a document jsonl file. We release a number of them on Huggingface Model, which can be automatically downloaded in the script by placing the path in the --psq_file flag in the format of {repo_id}:{flie_path}. Alternatively, you can also pass in a local .json.gz file that contains a dictionary of dictionaries, mapping from source tokens (string) to target tokens (string) to alignment probabilities. However, the default tokenizer in the script uses mosestokenier, which may not match the one in your own alignment matrix. You should either use mosestokenier when aligning the bitext or replace the tokenizer with yours.

The document file should be a jsonl file with one document in each row. You can specify the field for document id, title, and body text by passing in the field name in the file through --docid, --title, and --body respectively. Alternatively, you can also use --doc_source with irds: as prefix to use a dataset in ir_datasets.

The following is an example indexing command.

python -m fast_psq.index \
--doc_file irds:neuclir/1/zh/trec-2022 \
--lang zh \
--psq_file hltcoe/psq_translation_tables:zh.table.dict.gz \
--min_translation_prob 0.00010 \
--max_translation_alternatives 64 \
--max_translation_cdf 0.99 \
--docid doc_id \
--title title \
--body text \
--min_translation_prob 1e-4 \
--max_translation_alternatives 64 \
--output_dir ./indexes/neuclir-zh.f32/ \
--compression \
--nworkers 64

Please use python -m fast_psq.index --help for more information about the arguments.

Searching

The search script takes the index and a tsv query file and output a TREC style result file. Similarly, we support ir_datasets as well with irds: prefix in both --query_source and --qrels arguments.

The following command is an example.

python -m fast_psq.search \
--query_source irds:neuclir/1/zh/trec-2022 \
--query_field title \
--index_dir ./indexes/neuclir-zh.f32/ \
--qrels irds:neuclir/1/zh/trec-2022 \
--query_lang en \
--output_file ./neuclir-zh.en.title.f32.trec

Please use python -m fast_psq.search --help for more information about the arguments.

Citation

@article{psq-repro,
    title = {Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval},
    author = {Eugene Yang and Suraj Nair and Dawn Lawrie and James Mayfield and Douglas W. Oard and Kevin Duh},
    journal = {arXiv preprint arXiv},
    year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_psq-0.1.0.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

fast_psq-0.1.0-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file fast_psq-0.1.0.tar.gz.

File metadata

  • Download URL: fast_psq-0.1.0.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.5

File hashes

Hashes for fast_psq-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5f069881c3274d65fe0175632e83baad5acd046d847bb20c52a9dfb3d71ec77a
MD5 0c7770adfbd690606af4d79bb5934c4a
BLAKE2b-256 7456cef8d491ac07c09799bfc765921e0745a77d74ad3a903bca17f5665d2b26

See more details on using hashes here.

Provenance

File details

Details for the file fast_psq-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fast_psq-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.5

File hashes

Hashes for fast_psq-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cacc7d6be72ab003bc303140422c0b0ea3aa04a383e93a2624e68f0ea7265cc2
MD5 40e6c314ca887a454f922acf7ae81ee7
BLAKE2b-256 d2b29e4de9cb5f18444211fc71e739ccee251535fcd2fcffa29cb550035d286b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page