The NTCIR-10 Math Converter package converts NTCIR-10 Math XHTML dataset and relevance judgements to the NTCIR-11 Math-2, and NTCIR-12 MathIR XHTML5 format.

These details have not been verified by PyPI

Project links

Project description

NTCIR-10 Math Converter – Converts NTCIR-10 Math datasets, and judgements into NTCIR-11 and NTCIR-12 format

The retrieval unit in the NTCIR-10 Math task dataset is an arXiv document and the judgement unit in the relevance judgements is an XML element. On the other hand, the retrieval and judgement units in the NTCIR-11 Math-2, and NTCIR-12 MathIR task dataset, and relevance judgements is an arXiv document paragraph. This makes it difficult to use both datasets together in a single evaluation.

NTCIR Math converter is a Python 3 command-line utility that converts the NTCIR-10 Math XHTML5 dataset and relevance judgements to the NTCIR-11 Math-2, and NTCIR-12 MathIR XHTML5 format by splitting the dataset into paragraphs and redirecting the relevance judgements from elements to their ancestral paragraphs. As a result, the NTCIR-10 Math dataset, and relevance judgements can be easily used together with the NTCIR-11 Math-2, and NTCIR-12 MathIR dataset, and relevance judgements in a single workflow.

Usage

Installing

The package can be installed by executing the following command: Installing:

$ pip install ntcir10-math-converter

Displaying the usage

Usage information for the package can be displayed by executing the following command:

$ ntcir10-math-converter --help
usage: ntcir10-math-converter [-h] --dataset DATASET [DATASET ...]
                              [--judgements JUDGEMENTS [JUDGEMENTS ...]]
                              [--num-workers NUM_WORKERS]

Convert NTCIR-10 Math XHTML5 dataset and relevance judgements to the NTCIR-11
Math-2, and NTCIR-12 MathIR XHTML5 format.

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET [DATASET ...]
                        A path to a directory containing the NTCIR-10 Math
                        XHTML5 dataset, and a path to a non-existent directory
                        that will contain resulting dataset in the NTCIR-11
                        Math-2, and NTCIR-12 MathIR XHTML5 format. If only the
                        path to the NTCIR-10 Math dataset is specified, the
                        dataset will be read to find out the mapping between
                        element identifiers, and paragraph identifiers. This
                        is required for converting the relevance judgements.
  --judgements JUDGEMENTS [JUDGEMENTS ...]
                        Paths to the files containing NTCIR-10 Math relevance
                        judgements (odd arguments), followed by paths to the
                        files that will contain resulting relevance judgements
                        in the NTCIR-11 Math-2, and NTCIR-12 MathIR format
                        (even arguments).
  --num-workers NUM_WORKERS
                        The number of processes that will be used for
                        processing the NTCIR-10 Math dataset. Defaults to 1.

Converting a dataset, and relevance judgements

The following command converts both a dataset, and relevance judgements using 64 worker processes:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 ntcir-10-converted \
>     --judgements \
>         NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
>         NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9634.03it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9671.33it/s]
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████| 100000/100000 [06:45<00:00, 246.50it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
Skipping identifier f080935#idp57072, as it appears outside a paragraph
Skipping identifier f039264#id60072, as it appears outside a paragraph
Skipping identifier f059698#id58538, as it appears outside a paragraph
...
Skipping identifier f023353#idp65840, as it appears outside a paragraph
Skipping identifier f048268#id53551, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
1425 / 1394 input / output relevance judgements
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
Skipping identifier f095981#id72919, as it appears outside a paragraph
Skipping identifier f061190#id56357, as it appears outside a paragraph
Skipping identifier f033738#id116089, as it appears outside a paragraph
...
Skipping identifier f019052#id54515, as it appears outside a paragraph
Skipping identifier f021845#id53581, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]
2129 / 2076 input / output relevance judgements

Converting only a dataset using 64 worker processes:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 ntcir-10-converted
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
100%|████████████████████████████████████████████████████| 100000/100000 [07:34<00:00, 220.10it/s]

The following command converts only relevance judgements using 64 worker processes:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 \
>     --judgements \
>         NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
>         NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9539.55it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9332.81it/s]
Processing dataset ntcir-10
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████████| 2405/2405 [00:16<00:00, 144.41it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
Skipping identifier f080935#idp57072, as it appears outside a paragraph
Skipping identifier f039264#id60072, as it appears outside a paragraph
Skipping identifier f059698#id58538, as it appears outside a paragraph
...
Skipping identifier f023353#idp65840, as it appears outside a paragraph
Skipping identifier f048268#id53551, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
1425 / 1394 input / output relevance judgements
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
Skipping identifier f095981#id72919, as it appears outside a paragraph
Skipping identifier f061190#id56357, as it appears outside a paragraph
Skipping identifier f033738#id116089, as it appears outside a paragraph
...
Skipping identifier f019052#id54515, as it appears outside a paragraph
Skipping identifier f021845#id53581, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]
2129 / 2076 input / output relevance judgements

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Jun 21, 2018

0.2.1

Jun 16, 2018

0.1.6

Jun 6, 2018

0.1.5

May 21, 2018

0.1.4

May 19, 2018

0.1.3

May 18, 2018

0.1.2

May 18, 2018

0.1.1

May 18, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ntcir10_math_converter-0.2.2.tar.gz (7.7 kB view details)

Uploaded Jun 21, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ntcir10_math_converter-0.2.2-py2.py3-none-any.whl (9.0 kB view details)

Uploaded Jun 21, 2018 Python 2Python 3

File details

Details for the file ntcir10_math_converter-0.2.2.tar.gz.

File metadata

Download URL: ntcir10_math_converter-0.2.2.tar.gz
Upload date: Jun 21, 2018
Size: 7.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for ntcir10_math_converter-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`0152030ccef1440808dfb551f0c0df6272478e82791d67001ab8794c8ad06ff1`
MD5	`1565f1f307f0fb7c45380c595a84e352`
BLAKE2b-256	`3c8f02d7a2b436c6b92e923514c62cfc330bb2ef5bbad62829b2f9785121bed6`

See more details on using hashes here.

File details

Details for the file ntcir10_math_converter-0.2.2-py2.py3-none-any.whl.

File metadata

Download URL: ntcir10_math_converter-0.2.2-py2.py3-none-any.whl
Upload date: Jun 21, 2018
Size: 9.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for ntcir10_math_converter-0.2.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a743ed0ea1032c99f25109e75c7fe93ba4fa4ed799d9e6c7a81054a9e2cc932`
MD5	`42c158c64cf0170f1ff10367ead5b2ed`
BLAKE2b-256	`b8f9eb29f92d9668512494c8787454d7300911469938b36f0b77d930eaf1e9c0`

See more details on using hashes here.

ntcir10-math-converter 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NTCIR-10 Math Converter – Converts NTCIR-10 Math datasets, and judgements into NTCIR-11 and NTCIR-12 format

Usage

Installing

Displaying the usage

Converting a dataset, and relevance judgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes