Approach Zero Python Interface
Project description
Note This branch is for replicating our results in the recent SIGIR paper: One Blade for One Purpose: Advancing Math Information Retrieval using Hybrid Search.
The MABOWDOR :guide_dog: training data and checkpoints can be found in our training and inference subdirectory. Feel free to ask us any questions via email: w32zhong@uwaterloo.ca
If you find our work useful, please cite our paper:
@inproceedings{zhong2023mabowdor,
booktitle={Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
title={One Blade for One Purpose: Advancing Math Information Retrieval using Hybrid Search},
author={Zhong, Wei and Sheng-Chieh Lin and Jheng-Hong Yang and Jimmy Lin},
year={2023}
}
About
PyA0 is a Python wrapper for Approach Zero search engine. It provides Python interface to make the search engine core easy to play with.
In order to build this Python module, you will need to have this repository placed as a git submodule from its parent repository and then issue make
in this directory.
Quick Start
Install the PyPI package using pip
$ sudo pip3 install --upgrade pya0
If you find pip is unable to find package, update to the latest pip and try again:
$ sudo apt-get install curl python3-distutils
$ curl https://bootstrap.pypa.io/get-pip.py | python3
$ sudo pip install -i https://pypi.python.org/simple/ --trusted-host pypi.org pya0
$ python3 -c 'import pya0'
Math token lexing
Test a simple math token scanner:
import pya0
lst = pya0.lex('\\lim_{n \\to \\infty} (1 + 1/n)^n')
print(lst)
Result:
[(269, 'LIM', 'lim'), (274, 'SUBSCRIPT', 'subscript'), (260, 'VAR', "normal`n'"), (270, 'ARROW', 'to'), (260, 'INFTY', 'infty'), (259, 'ONE', "`1'"), (261, 'ADD', 'plus'), (259, 'ONE', "`1'"), (264, 'FRAC', 'frac'), (260, 'VAR', "normal`n'"), (275, 'SUPSCRIPT', 'supscript'), (260, 'VAR', "normal`n'")]
Refer to tests/
directory for more complete example usages.
Download prebuilt index and search
See what prebuilt indexes we have available:
python -m pya0 --list-prebuilt-indexes
ntcir-wfb
description NTCIR-12 Wikipedia Formula Browsing
image_filesystem reiserfs
md5 6e87fc52a8f02c05113034c4b14b3e06
urls [https://vault.cs.uwaterloo.ca/s/gySLti89gZF8xz6/download]
...
One-line command to download the NTCIR-12 WFB index and evaluate its topics:
rm -f ntcir12_wfb.run
python -m pya0 --use-fallback-parser --verbose \
--index ntcir-wfb --collection ntcir12-math-browsing-concrete --trec-output ntcir12_wfb.run
Now evaluate the generated run:
$ ./eval-ntcir12.sh tsv ./ntcir12_wfb.run
./ntcir12_wfb.run 0.6304 0.5411
Example Usage
Generate NTCIR-12 run:
python -m pya0 --use-fallback-parser --index ../../indexes/mnt-ntcir12_wfb.img/ --collection ntcir12-math-browsing-concrete --trec-output runs/ntcir12_wfb.run
Generate ARQMath (2022) runs:
# task 1
python -m pya0 --stemmer porter --index ../../indexes/mnt-arqmath-task1.img/ --collection arqmath-2022-task1-manual --trec-output runs/arqmath_task1.run
# task 2
python -m pya0 --index ../../indexes/mnt-arqmath-task2.img/ --collection arqmath-2022-task2-refined --trec-output runs/arqmath_task2.run
Generate grid-search runs:
python -m pya0 --use-fallback-parser --index ../../indexes/mnt-ntcir12_wfb.img/ --collection ntcir12-math-browsing-concrete --auto-eval ./experiments/auto_eval--symbol-scores.tsv
Evaluate a run (evaluating ARQMath Task 2 will require downloading slt_representation_v3
):
wget https://vault.cs.uwaterloo.ca/s/TpSPrZY4xxRYGS2/download -O training-and-inference/datasets/latex_representation_v3.zip
unzip training-and-inference/datasets/latex_representation_v3.zip
./eval-arqmath3/task2/eval.sh --tsv=./slt_representation_v3/ --nojudge
Transformer Models
Please check out ./training-and-inference
Making Search Index (optional)
First, use utils/corpus_converter.py to convert raw dataset files to jsonl
file, the latter can be fed to approach0 indexerd using a0-crawlers feeder.
Datasets can be found here: https://vault.cs.uwaterloo.ca/s/RTJ27g9Ek2kanRe
We have made pre-processed jsonl files available, download them using:
wget https://vault.cs.uwaterloo.ca/s/s2bcWssfAHHyeTF/download -O ntcir12_wfb.jsonl
wget https://vault.cs.uwaterloo.ca/s/ANg5XQyGLsZPXLL/download -O arqmath3_task1.jsonl
wget https://vault.cs.uwaterloo.ca/s/tY5SfDgErgkBr28/download -O arqmath3_task2.jsonl
Second, create index images and mount them as loop devices
cd a0-engine
sudo ./indexerd/scripts/vdisk-setup.sh
vdisk-creat.sh reiserfs 1K
vdisk-mount.sh reiserfs vdisk.img
To make enough space for common datasets, consider these examples
df -h
/dev/loop6 12G 8.2G 3.9G 68% /home/w32zhong/indexes/mnt-arqmath-task1.img
/dev/loop8 12G 9.0G 3.1G 75% /home/w32zhong/indexes/mnt-arqmath-task2.img
/dev/loop9 5.0G 609M 4.5G 12% /home/w32zhong/indexes/mnt-ntcir12_wfb.img
Thrid, run indexerd daemon to create index for NTCIR-12 WFB:
cd a0-engine/indexerd
./run/indexerd.out -o ~/indexes/mnt-ntcir12_wfb.img/ -p 8935 -e
then, run feeder to feed jsonl file to indexerd:
cd a0-crawlers/feeder
python feeder.py --indexd-url http://localhost:8935/index --bye --corpus ntcir12_wfb ./feeder.ini ~/corpus/ntcir12_wfb.jsonl
For the ARQMath datasets, use arqmath_task1_default__use_porter_stemmer
and arqmath_task2_v3
as corpus names.
Building This Package
Build for Local Package
Build and install package locally (for testing):
$ make clean
$ sudo python3 setup.py install
then, you can import as library from system path:
import pya0
print(dir(pya0))
Build for Manylinux Distribution
Install Docker:
apt-get update
which docker || curl -fsSL https://get.docker.com -o get-docker.sh
which docker || sh get-docker.sh
Pull and run image quay.io/pypa/manylinux_2_24_x86_64
at the parent source directory of approach0
and assume $HOME
is where you put Indri and Jieba code:
sudo docker run -it -v `pwd`:/code -v $HOME:/host quay.io/pypa/manylinux_2_24_x86_64 bash
Inside docker container, build pya0 as instructed below, so that you have a linux wheel, e.g., ./dist/pya0-0.1-cp35-cp35m-linux_x86_64.whl
.
Typical build process:
# Inside docker, setup system environment...
apt update
apt install -y git build-essential g++ cmake wget flex bison python3
apt install -y libz-dev libevent-dev libopenmpi-dev libxml2-dev libfl-dev
apt install -y libiberty-dev
apt install -y build-essential python-dev python3-pip python3-venv
python3 -m pip install --upgrade build # install pip-build tool
# Now, start building (or if you enter from the quickstart image)...
cd /code
./configure --indri-path=/host/indri --jieba-path=/host/cppjieba
(cd /host/indri && make clean && make) # this one takes minutes to build
make clean && make
cd ./pya0 && make clean && make
Use docker commit $(docker ps -q | head -1) quickstart
to save the container for later re-use:
sudo docker run -it -v `pwd`:/code -v $HOME:/host quickstart bash
Create a pip
distribution package:
$ rm -rf dist wheelhouse
$ python3 -m build
Upload to PyPI.org
Edit setup.py
and bump up version number.
Install twine
$ apt install rustc libssl-dev libffi-dev
$ python3 -m pip install --user --upgrade twine
Then inspect the wheel:
$ auditwheel show ./dist/pya0-*.whl
pya0-0.1-cp35-cp35m-linux_x86_64.whl is consistent with the following
platform tag: "linux_x86_64".
The wheel references external versioned symbols in these system-
provided shared libraries: libgcc_s.so.1 with versions {'GCC_3.0'},
libz.so.1 with versions {'ZLIB_1.2.0', 'ZLIB_1.2.3.3',
'ZLIB_1.2.2.3'}, libstdc++.so.6 with versions {'GLIBCXX_3.4.10',
'GLIBCXX_3.4.11', 'GLIBCXX_3.4.21', 'GLIBCXX_3.4.15', 'CXXABI_1.3',
'CXXABI_1.3.8', 'GLIBCXX_3.4', 'CXXABI_1.3.9', 'GLIBCXX_3.4.9',
'CXXABI_1.3.1', 'GLIBCXX_3.4.20'}, libpthread.so.0 with versions
{'GLIBC_2.2.5', 'GLIBC_2.3.2', 'GLIBC_2.3.3'}, libc.so.6 with versions
{'GLIBC_2.7', 'GLIBC_2.17', 'GLIBC_2.3.4', 'GLIBC_2.15', 'GLIBC_2.3',
'GLIBC_2.3.2', 'GLIBC_2.4', 'GLIBC_2.22', 'GLIBC_2.2.5',
'GLIBC_2.14'}, libdl.so.2 with versions {'GLIBC_2.2.5'}, libm.so.6
with versions {'GLIBC_2.2.5'}, liblzma.so.5 with versions {'XZ_5.0'}
This constrains the platform tag to "manylinux_2_24_x86_64". In order
to achieve a more compatible tag, you would need to recompile a new
wheel from source on a system with earlier versions of these
libraries, such as a recent manylinux image.
the auditwheel
suggests to use platform manylinux_2_24_x86_64
.
Fix it to that platform:
$ auditwheel repair ./dist/*.whl --plat manylinux_2_24_x86_64 -w ./wheelhouse
INFO:auditwheel.main_repair:Repairing pya0-0.2.8-py3-none-any.whl
INFO:auditwheel.wheeltools:Previous filename tags: any
INFO:auditwheel.wheeltools:New filename tags: manylinux_2_24_x86_64
INFO:auditwheel.wheeltools:Previous WHEEL info tags: py3-none-any
INFO:auditwheel.wheeltools:Changed wheel type to Platlib
INFO:auditwheel.wheeltools:New WHEEL info tags: py3-none-manylinux_2_24_x86_64
INFO:auditwheel.main_repair:
Fixed-up wheel written to /code/pya0/wheelhouse/pya0-0.2.8-py3-none-manylinux_2_24_x86_64.whl
Then you should be able to upload to PIP:
$ python3 -m twine upload --repository pypi wheelhouse/*.whl
(use username __token__
and your created token on https://pypi.org
)
Use unzip
to view and check if shared libraries are there in the manylinux wheel:
root@1c06f5c28b7b:/host/a0-engine/pya0# unzip -l wheelhouse/pya0-0.1.7-py3-none-manylinux_2_24_x86_64.whl
Archive: wheelhouse/pya0-0.1.7-py3-none-manylinux_2_24_x86_64.whl
Length Date Time Name
--------- ---------- ----- ----
927 2021-03-08 19:00 setup.py
2065112 2021-03-08 19:01 pya0.libs/libxml2-bbd52ef6.so.2.9.4
2020736 2021-03-08 19:01 pya0.libs/libicuuc-5743fca1.so.57.1
43296 2021-03-08 19:01 pya0.libs/libltdl-e9c06fbe.so.7.3.1
272392 2021-03-08 19:01 pya0.libs/libhwloc-811858d2.so.5.7.2
312216 2021-03-08 19:01 pya0.libs/libevent-2-6d3aa264.0.so.5.1.9
3805032 2021-03-08 19:01 pya0.libs/libicui18n-03536ef3.so.57.1
159384 2021-03-08 19:01 pya0.libs/liblzma-5b8415cf.so.5.2.2
640624 2021-03-08 19:01 pya0.libs/libopen-rte-6abe1f34.so.20.1.0
108624 2021-03-08 19:01 pya0.libs/libz-7fd423a0.so.1.2.8
1079848 2021-03-08 19:01 pya0.libs/libmpi-69c5bc42.so.20.0.2
785248 2021-03-08 19:01 pya0.libs/libopen-pal-321722b9.so.20.2.0
48432 2021-03-08 19:01 pya0.libs/libnuma-c8473f23.so.1.0.0
25678440 2021-03-08 19:01 pya0.libs/libicudata-79cf9efa.so.57.1
1 2021-03-08 19:01 pya0-0.1.7.dist-info/top_level.txt
133 2021-03-08 19:01 pya0-0.1.7.dist-info/WHEEL
5581 2021-03-08 19:01 pya0-0.1.7.dist-info/METADATA
1757 2021-03-08 19:01 pya0-0.1.7.dist-info/RECORD
24 2021-03-08 18:51 pya0/__init__.py
75878488 2021-03-08 19:01 pya0/pya0.so
--------- -------
112906295 20 files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for pya0-0.3.7-py3-none-manylinux_2_24_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 055ab37a314368e06860f736937c57a8977c2f4f9c74ebec54b6424da72d95df |
|
MD5 | 0e99496f3cd38559cf40e9a3b72a49dc |
|
BLAKE2b-256 | 4fd3d33c1787cc13632fcf846f5c1ef80abad553850aa00c52556d4affe2572c |