The NTCIR-10 Math Converter package converts NTCIR-10 Math dataset and relevance judgements to the NTCIR-11 Math-2, and NTCIR-12 MathIR format.
Project description
Introduction
The retrieval unit in the NTCIR-10 Math task dataset is an arXiv document and the judgement unit in the relevance judgements is an XML element. On the other hand, the retrieval and judgement units in the NTCIR-11 Math-2, and NTCIR-12 MathIR task dataset, and relevance judgements is an arXiv document paragraph. This makes it difficult to use both datasets together in a single evaluation.
NTCIR Math converter is a Python 3 command-line utility that converts the NTCIR-10 Math dataset and relevance judgements to the NTCIR-11 Matn-2, and NTCIR-12 MathIR format by splitting the dataset into paragraphs and redirecting the relevance judgements from elements to their ancestral paragraphs. As a result, the NTCIR-10 Math dataset, and relevance judgements can be easily used together with the NTCIR-11 Math-2, nad NTCIR-12 MathIR dataset, and relevance judgements in a single evaluation.
Usage
Displaying the usage:
$ ntcir10-math-converter --help
usage: ntcir10-math-converter [-h] --dataset DATASET [DATASET ...]
[--judgements JUDGEMENTS [JUDGEMENTS ...]]
[--num-workers NUM_WORKERS]
Convert NTCIR-10 Math dataset and relevance judgements to the NTCIR-11 Math-2,
and NTCIR-12 MathIR format.
optional arguments:
-h, --help show this help message and exit
--dataset DATASET [DATASET ...]
A path to a directory containing the NTCIR-10 Math
dataset, and a path to a non-existent directory that
will contain resulting dataset in the NTCIR-11 Math-2,
and NTCIR-12 MathIR format. If only the path to the
NTCIR-10 Math dataset is specified, the dataset will
be read to find out the mapping between element
identifiers, and paragraph identifiers. This is
required for converting the relevance judgements.
--judgements JUDGEMENTS [JUDGEMENTS ...]
Paths to the files containing NTCIR-10 Math relevance
judgements (odd arguments), followed by paths to the
files that will contain resulting relevance judgements
in the NTCIR-11 Math-2, and NTCIR-12 MathIR format
(even arguments).
--num-workers NUM_WORKERS
The number of processes that will be used for
processing the NTCIR-10 Math dataset. Defaults to 1.
Converting both a dataset and relevance judgements using 64 worker processes:
$ ntcir10-math-converter --num-workers 64 \
> --dataset data/ntcir-10 data/ntcir-10-converted \
> --judgements \
> data/NTCIR_10_Math-qrels_ft.dat data/NTCIR_10_Math-qrels_ft-converted.dat \
> data/NTCIR_10_Math-qrels_fs.dat data/NTCIR_10_Math-qrels_fs-converted.dat
2018-05-18 18:44:58,555 : INFO : Retrieving judged document names, and element identifiers from data/NTCIR_10_Math-qrels_ft.dat
100%|██████████████████████████████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9634.03it/s]
2018-05-18 18:44:58,707 : INFO : Retrieving judged document names, and element identifiers from data/NTCIR_10_Math-qrels_fs.dat
100%|██████████████████████████████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9671.33it/s]
2018-05-18 18:44:58,928 : INFO : Processing dataset data/ntcir-10
100%|███████████████████████████████████████████████████████████████████████████| 100000/100000 [06:45<00:00, 246.50it/s]
2018-05-18 18:51:46,219 : INFO : Converting relevance judgements data/NTCIR_10_Math-qrels_ft.dat -> data/NTCIR_10_Math-qrels_ft-converted.dat
100%|████████████████████████████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
2018-05-18 18:51:46,228 : INFO : Converting relevance judgements data/NTCIR_10_Math-qrels_fs.dat -> data/NTCIR_10_Math-qrels_fs-converted.dat
100%|████████████████████████████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]
Converting only a dataset using 64 worker processes:
$ ntcir10-math-converter --num-workers 64 --dataset data/ntcir-10 data/ntcir-10-converted
2018-05-18 19:09:08,162 : INFO : Converting dataset data/ntcir-10 -> data/ntcir-10-converted/xhtml5
100%|███████████████████████████████████████████████████████████████████████████| 100000/100000 [07:34<00:00, 220.10it/s]
Converting only relevance judgements using 64 worker processes:
$ ntcir10-math-converter --num-workers 64 \
> --dataset data/ntcir-10 \
> --judgements \
> data/NTCIR_10_Math-qrels_ft.dat data/NTCIR_10_Math-qrels_ft-converted.dat \
> data/NTCIR_10_Math-qrels_fs.dat data/NTCIR_10_Math-qrels_fs-converted.dat
2018-05-18 19:18:02,024 : INFO : Retrieving judged document names, and element identifiers from data/NTCIR_10_Math-qrels_ft.dat
100%|██████████████████████████████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9539.55it/s]
2018-05-18 19:18:02,178 : INFO : Retrieving judged document names, and element identifiers from data/NTCIR_10_Math-qrels_fs.dat
100%|██████████████████████████████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9332.81it/s]
2018-05-18 19:18:02,408 : INFO : Processing dataset data/ntcir-10
2018-05-18 19:18:02,408 : INFO : Building a mapping between element identifiers, and paragraph identifiers
100%|███████████████████████████████████████████████████████████████████████████████| 2405/2405 [00:16<00:00, 144.41it/s]
2018-05-18 19:18:26,246 : INFO : Converting relevance judgements data/NTCIR_10_Math-qrels_ft.dat -> data/NTCIR_10_Math-qrels_ft-converted.dat
100%|████████████████████████████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 260760.14it/s]
2018-05-18 19:18:26,256 : INFO : Converting relevance judgements data/NTCIR_10_Math-qrels_fs.dat -> data/NTCIR_10_Math-qrels_fs-converted.dat
100%|████████████████████████████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 299442.45it/s]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ntcir10_math_converter-0.1.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63ad03444a5056d18753a780ac358bc7ef53bcf85f34e00c9f7fae2483e884ad |
|
MD5 | d6cecfcc76cb140e22c509383dcea38d |
|
BLAKE2b-256 | b24c74b5f9a073288c5c483e8aca9c931812bf157f55aabd13f0214978d7d39f |
Hashes for ntcir10_math_converter-0.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbe69c567b44d34b11e2c27a2d252be82b05080e64954485ac49aba70500b50a |
|
MD5 | 657ad027b8758afaf17e63c4c7f0d3fc |
|
BLAKE2b-256 | 04955848df81ff193a9df74d2c2781a9843638bf7f2325c8942a883b721dbfba |