Skip to main content

The NTCIR-10 Math Converter package converts NTCIR-10 Math dataset and relevance judgements to the NTCIR-11 Math-2, and NTCIR-12 MathIR format.

Project description

Introduction

The retrieval unit in the NTCIR-10 Math task dataset is an arXiv document and the judgement unit in the relevance judgements is an XML element. On the other hand, the retrieval and judgement units in the NTCIR-11 Math-2, and NTCIR-12 MathIR task dataset, and relevance judgements is an arXiv document paragraph. This makes it difficult to use both datasets together in a single evaluation.

NTCIR Math converter is a Python 3 command-line utility that converts the NTCIR-10 Math dataset and relevance judgements to the NTCIR-11 Math-2, and NTCIR-12 MathIR format by splitting the dataset into paragraphs and redirecting the relevance judgements from elements to their ancestral paragraphs. As a result, the NTCIR-10 Math dataset, and relevance judgements can be easily used together with the NTCIR-11 Math-2, and NTCIR-12 MathIR dataset, and relevance judgements in a single evaluation.

Usage

Installing:

$ pip install ntcir10-math-converter

Displaying the usage:

$ ntcir10-math-converter --help
usage: ntcir10-math-converter [-h] --dataset DATASET [DATASET ...]
                              [--judgements JUDGEMENTS [JUDGEMENTS ...]]
                              [--num-workers NUM_WORKERS]

Convert NTCIR-10 Math dataset and relevance judgements to the NTCIR-11 Math-2,
and NTCIR-12 MathIR format.

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET [DATASET ...]
                        A path to a directory containing the NTCIR-10 Math
                        dataset, and a path to a non-existent directory that
                        will contain resulting dataset in the NTCIR-11 Math-2,
                        and NTCIR-12 MathIR format. If only the path to the
                        NTCIR-10 Math dataset is specified, the dataset will
                        be read to find out the mapping between element
                        identifiers, and paragraph identifiers. This is
                        required for converting the relevance judgements.
  --judgements JUDGEMENTS [JUDGEMENTS ...]
                        Paths to the files containing NTCIR-10 Math relevance
                        judgements (odd arguments), followed by paths to the
                        files that will contain resulting relevance judgements
                        in the NTCIR-11 Math-2, and NTCIR-12 MathIR format
                        (even arguments).
  --num-workers NUM_WORKERS
                        The number of processes that will be used for
                        processing the NTCIR-10 Math dataset. Defaults to 1.

Converting both a dataset, and relevance judgements using 64 worker processes:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 ntcir-10-converted \
>     --judgements \
>         NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
>         NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9634.03it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9671.33it/s]
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████| 100000/100000 [06:45<00:00, 246.50it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]

Converting only a dataset using 64 worker processes:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 ntcir-10-converted
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
100%|████████████████████████████████████████████████████| 100000/100000 [07:34<00:00, 220.10it/s]

Converting only relevance judgements using 64 worker processes:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 \
>     --judgements \
>         NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
>         NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9539.55it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9332.81it/s]
Processing dataset ntcir-10
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████████| 2405/2405 [00:16<00:00, 144.41it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 260760.14it/s]
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 299442.45it/s]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ntcir10_math_converter-0.1.5.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ntcir10_math_converter-0.1.5-py2.py3-none-any.whl (8.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file ntcir10_math_converter-0.1.5.tar.gz.

File metadata

File hashes

Hashes for ntcir10_math_converter-0.1.5.tar.gz
Algorithm Hash digest
SHA256 db512dbc0313367613fd5723e4a0ec5b3b71d344a26bcf622261b0243cc059b1
MD5 a5eaa7c609888eb0ead528ec8d50fd99
BLAKE2b-256 872f3d9fe4100301f6052e317a97f085c1613d8a8a94356e44c9f8cfec05b561

See more details on using hashes here.

File details

Details for the file ntcir10_math_converter-0.1.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for ntcir10_math_converter-0.1.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 55c813823d7529d02afb30924e76397d063649124b6fd2fa79cf6f8a40c9a024
MD5 9b9988687229b6b6a972a3f4b60531c1
BLAKE2b-256 17f43a8a757890bb9ce19ce3b9f445f78b5f6a4ebeb67849c665f594c9555ebf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page