Efficient Implementation of Probabilistic Structured Queries
Project description
Efficient Implementation of Probabilistic Structured Queries
This package is an implementation of the Probablistic Structured Queries algorithm for cross-langauge information retrieval. It leverages alignment table from statistical machine translation to translate the document bag-of-tokens into the query language.
Raw translation tables are available on Huggingface Models hltcoe/psq_translation_tables
Get started
fast_psq is available on PyPI.
pip install fast_psq
Alternatively, you can also install directly from the GitHub main branch by using the following command.
pip install pip@git+https://github.com/hltcoe/PSQ
fast_psq works with ir_datasets and ir_measures quite well for accessing IR evaluation collections
and evaluating results. You can install the two packages with the following command.
pip install ir_datasets ir_measures
Indexing
The indexing script takes a translation table (i.e., alignment matrix) and a document jsonl file.
We release a number of them on Huggingface Model, which can be automatically downloaded
in the script by placing the path in the --psq_file flag in the format of {repo_id}:{flie_path}.
Alternatively, you can also pass in a local .json.gz file that contains a dictionary of dictionaries, mapping from
source tokens (string) to target tokens (string) to alignment probabilities.
However, the default tokenizer in the script uses mosestokenier, which may not match the one in your own
alignment matrix. You should either use mosestokenier when aligning the bitext or replace the tokenizer with yours.
The document file should be a jsonl file with one document in each row.
You can specify the field for document id, title, and body text by passing in the field name
in the file through --docid, --title, and --body respectively.
Alternatively, you can also use --doc_source with irds: as prefix to use a dataset in ir_datasets.
The following is an example indexing command.
python -m fast_psq.index \
--doc_file irds:neuclir/1/zh/trec-2022 \
--lang zh \
--psq_file hltcoe/psq_translation_tables:zh.table.dict.gz \
--min_translation_prob 0.00010 \
--max_translation_alternatives 64 \
--max_translation_cdf 0.99 \
--docid doc_id \
--title title \
--body text \
--min_translation_prob 1e-4 \
--max_translation_alternatives 64 \
--output_dir ./indexes/neuclir-zh.f32/ \
--compression \
--nworkers 64
Please use python -m fast_psq.index --help for more information about the arguments.
Searching
The search script takes the index and a tsv query file and output a TREC style result file.
Similarly, we support ir_datasets as well with irds: prefix in both --query_source and --qrels arguments.
The following command is an example.
python -m fast_psq.search \
--query_source irds:neuclir/1/zh/trec-2022 \
--query_field title \
--index_dir ./indexes/neuclir-zh.f32/ \
--qrels irds:neuclir/1/zh/trec-2022 \
--query_lang en \
--output_file ./neuclir-zh.en.title.f32.trec
Please use python -m fast_psq.search --help for more information about the arguments.
Citation
@article{psq-repro,
title = {Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval},
author = {Eugene Yang and Suraj Nair and Dawn Lawrie and James Mayfield and Douglas W. Oard and Kevin Duh},
journal = {arXiv preprint arXiv},
year = {2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_psq-0.1.0.tar.gz.
File metadata
- Download URL: fast_psq-0.1.0.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f069881c3274d65fe0175632e83baad5acd046d847bb20c52a9dfb3d71ec77a
|
|
| MD5 |
0c7770adfbd690606af4d79bb5934c4a
|
|
| BLAKE2b-256 |
7456cef8d491ac07c09799bfc765921e0745a77d74ad3a903bca17f5665d2b26
|
File details
Details for the file fast_psq-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fast_psq-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cacc7d6be72ab003bc303140422c0b0ea3aa04a383e93a2624e68f0ea7265cc2
|
|
| MD5 |
40e6c314ca887a454f922acf7ae81ee7
|
|
| BLAKE2b-256 |
d2b29e4de9cb5f18444211fc71e739ccee251535fcd2fcffa29cb550035d286b
|