Skip to main content

The NTCIR Math Density Estimator package uses NTCIR-10 Math, NTCIR-11 Math-2, and NTCIR-12 MathIR datasets to compute density, and probability estimators.

Project description

Introduction

NTCIR Math Density Estimator is a Python 3 command-line utility that computes and plots density, and probability estimators from judged datasets in the NTCIR-11 Math-2, and NTCIR-12 MathIR format. Most importantly, the package estimates the probability P(relevant | position), where position is a position of a paragraph in a document.

Usage

Installing:

$ pip install ntcir-math-density

Displaying the usage:

$ ntcir-math-density --help
usage: ntcir-math-density [-h] [--datasets DATASETS [DATASETS ...]]
                          [--judgements JUDGEMENTS [JUDGEMENTS ...]]
                          [--plots PLOTS [PLOTS ...]] [--positions POSITIONS]
                          [--estimators ESTIMATORS] [--num-workers NUM_WORKERS]

Use NTCIR-10 Math, NTCIR-11 Math-2, and NTCIR-12 MathIR datasets to compute
density, and probability estimators.

optional arguments:
-h, --help            show this help message and exit
--datasets DATASETS [DATASETS ...]
                        Paths to the directories containing the datasets. Each
                        path must be prefixed with a unique single-letter
                        label (e.g. "A=/some/path"). Note that all the
                        datasets must be in the NTCIR-11 Math-2, and NTCIR-12
                        MathIR format, even the NTCIR-10 Math dataset.
--judgements JUDGEMENTS [JUDGEMENTS ...]
                        Paths to the files containing relevance judgements.
                        Each path must be prefixed with single-letter labels
                        corresponding to the judged datasets (e.g.
                        "A:/some/path/judgement.dat"). Note that all the
                        judgements must be in the NTCIR-11 Math-2, and
                        NTCIR-12 MathIR format, even the NTCIR-10 Math dataset
                        judgements.
--plots PLOTS [PLOTS ...]
                        The path to the files, where the probability
                        estimators will plotted. When no datasets are
                        specified, the estimators file will be loaded.
--positions POSITIONS
                        The path to the file, where the estimated positions of
                        all paragraph identifiers from all datasets will be
                        stored. Defaults to positions.pkl.gz.
--estimators ESTIMATORS
                        The path to the file, where the density, and
                        probability estimators will be stored. When no
                        datasets are specified, this file will be loaded to
                        provide the estimators for plotting. Defaults to
                        estimators.pkl.gz.
--num-workers NUM_WORKERS
                        The number of processes that will be used for
                        processing the NTCIR-10 Math dataset, and for
                        computing the density, and probability estimates.
                        Defaults to 1.

Extracting density, and probability estimators, and plotting the estimates using 64 worker processes:

$ ntcir-math-density --num-workers 64 \
>     --datasets A=ntcir-10-converted B=ntcir-11-12 \
>     --judgements A:NTCIR_10_Math-qrels_fs-converted.dat A:NTCIR_10_Math-qrels_ft-converted.dat \
>                  B:NTCIR11_Math-qrels.dat B:NTCIR12_Math-qrels_agg.dat \
>                  B:NTCIR12_Math_simto-qrels_agg.dat \
>     --estimators estimators.pkl.gz --positions positions.pkl.gz \
>     --plots plot.pdf plot.svg
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_fs-converted.dat
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 334959.05it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_ft-converted.dat
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 353201.94it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR11_Math-qrels.dat
100%|█████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 343345.12it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 4251/4251 [00:00<00:00, 342252.50it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math_simto-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 654/654 [00:00<00:00, 314428.57it/s]
Retrieving all paragraph identifiers, and positions from ntcir-10-converted
get_all_identifiers(ntcir-10-converted): 5405167it [04:30, 19946.57it/s]
get_all_positions(ntcir-10-converted): 100%|█████████| 5405167/5405167 [08:44<00:00, 10306.72it/s]
Retrieving all paragraph identifiers, and positions from ntcir-11-12
get_all_identifiers(ntcir-11-12): 8301578it [08:08, 16985.19it/s]
get_all_positions(ntcir-11-12): 100%|█████████████████| 8301578/8301578 [44:30<00:00, 3108.88it/s]
1043 / 3146 / 5405167 relevant / judged / total identifiers in dataset ntcir-10-converted
1742 / 7059 / 8301578 relevant / judged / total identifiers in dataset ntcir-11-12
Pickling positions.pkl.gz
Fitting density, and probability estimators
Fitting prior p(position) density estimator
Fitting conditional p(position | relevant) density estimator
Pickling estimators.pkl.gz
Computing density, and probability estimates for a plot
p(position): 100%|████████████████████████████████████████████████| 64/64 [01:19<00:00,  1.24s/it]
p(position|relevant): 100%|███████████████████████████████████████| 64/64 [01:19<00:00,  1.24s/it]
Plotting plot.svg
Plotting plot.pdf

Extracting density, and probability estimators using 64 worker processes:

$ ntcir-math-density --num-workers 64 \
>     --datasets A=ntcir-10-converted B=ntcir-11-12 \
>     --judgements A:NTCIR_10_Math-qrels_fs-converted.dat A:NTCIR_10_Math-qrels_ft-converted.dat \
>                  B:NTCIR11_Math-qrels.dat B:NTCIR12_Math-qrels_agg.dat \
>                  B:NTCIR12_Math_simto-qrels_agg.dat \
>     --estimators estimators.pkl.gz --positions positions.pkl.gz
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_fs-converted.dat
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 334959.05it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_ft-converted.dat
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 353201.94it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR11_Math-qrels.dat
100%|█████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 343345.12it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 4251/4251 [00:00<00:00, 342252.50it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math_simto-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 654/654 [00:00<00:00, 314428.57it/s]
Retrieving all paragraph identifiers, and positions from ntcir-10-converted
get_all_identifiers(ntcir-10-converted): 5405167it [04:30, 19946.57it/s]
get_all_positions(ntcir-10-converted): 100%|█████████| 5405167/5405167 [08:44<00:00, 10306.72it/s]
Retrieving all paragraph identifiers, and positions from ntcir-11-12
get_all_identifiers(ntcir-11-12): 8301578it [08:08, 16985.19it/s]
get_all_positions(ntcir-11-12): 100%|█████████████████| 8301578/8301578 [44:30<00:00, 3108.88it/s]
1043 / 3146 / 5405167 relevant / judged / total identifiers in dataset ntcir-10-converted
1742 / 7059 / 8301578 relevant / judged / total identifiers in dataset ntcir-11-12
Pickling positions.pkl.gz
Fitting density, and probability estimators
Fitting prior p(position) density estimator
Fitting conditional p(position | relevant) density estimator
Pickling estimators.pkl.gz

Plotting the estimates using 64 worker processes:

$ ntcir-math-density --num-workers 64 \
>     --estimators estimators.pkl.gz --plots plot.pdf plot.svg
Unpickling estimators.pkl.gz
Computing density, and probability estimates for a plot
p(position): 100%|████████████████████████████████████████████████| 64/64 [01:19<00:00,  1.24s/it]
p(position|relevant): 100%|███████████████████████████████████████| 64/64 [01:19<00:00,  1.24s/it]
Plotting plot.svg
Plotting plot.pdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for ntcir-math-density, version 0.1.3
Filename, size File type Python version Upload date Hashes
Filename, size ntcir_math_density-0.1.3-py2.py3-none-any.whl (10.6 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size ntcir_math_density-0.1.3-py3.6.egg (18.5 kB) File type Egg Python version 3.6 Upload date Hashes View hashes
Filename, size ntcir_math_density-0.1.3.tar.gz (8.9 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page