The NTCIR Math Density Estimator package uses NTCIR-10 Math, NTCIR-11 Math-2, and NTCIR-12 MathIR datasets to compute density, and probability estimates.
Project description
Introduction
NTCIR Math Density Estimator is a Python 3 command-line utility that computes
and plots density, and probability estimates from judged datasets in the
NTCIR-11 Math-2, and NTCIR-12
MathIR format. Most importantly, the package
estimates the probability P(relevant | position)
, where position
is a
position of a paragraph in a document.
Usage
Installing:
$ pip install ntcir-math-density
Displaying the usage:
$ ntcir-math-density --help
usage: ntcir-math-density [-h] [--datasets DATASETS [DATASETS ...]]
[--judgements JUDGEMENTS [JUDGEMENTS ...]]
[--plots PLOTS [PLOTS ...]] [--positions POSITIONS]
[--estimates ESTIMATES] [--num-workers NUM_WORKERS]
Use NTCIR-10 Math, NTCIR-11 Math-2, and NTCIR-12 MathIR datasets to compute
density, and probability estimates.
optional arguments:
-h, --help show this help message and exit
--datasets DATASETS [DATASETS ...]
Paths to the directories containing the datasets. Each
path must be prefixed with a unique single-letter
label (e.g. "A=/some/path"). Note that all the
datasets must be in the NTCIR-11 Math-2, and NTCIR-12
MathIR format, even the NTCIR-10 Math dataset.
--judgements JUDGEMENTS [JUDGEMENTS ...]
Paths to the files containing relevance judgements.
Each path must be prefixed with single-letter labels
corresponding to the judged datasets (e.g.
"A:/some/path/judgement.dat"). Note that all the
judgements must be in the NTCIR-11 Math-2, and
NTCIR-12 MathIR format, even the NTCIR-10 Math dataset
judgements.
--plots PLOTS [PLOTS ...]
The path to the files, where the probability
estimates will plotted. When no datasets are
specified, the estimates file will be loaded.
--positions POSITIONS
The path to the file, where the estimated positions of
all paragraph identifiers from all datasets will be
stored. Defaults to positions.pkl.gz.
--estimates ESTIMATES
The path to the file, where the density, and
probability estimates will be stored. When no
datasets are specified, this file will be loaded to
provide the estimates for plotting. Defaults to
estimates.pkl.gz.
--num-workers NUM_WORKERS
The number of processes that will be used for
processing the NTCIR-10 Math dataset, and for
computing the density, and probability estimates.
Defaults to 1.
Extracting density, and probability estimates, and plotting the estimates using 64 worker processes:
$ ntcir-math-density --num-workers 64 \
> --datasets A=ntcir-10-converted B=ntcir-11-12 \
> --judgements A:NTCIR_10_Math-qrels_fs-converted.dat A:NTCIR_10_Math-qrels_ft-converted.dat \
> B:NTCIR11_Math-qrels.dat B:NTCIR12_Math-qrels_agg.dat \
> B:NTCIR12_Math_simto-qrels_agg.dat \
> --estimates estimates.pkl.gz --positions positions.pkl.gz \
> --plots plot.pdf plot.svg
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_fs-converted.dat
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 334959.05it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_ft-converted.dat
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 353201.94it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR11_Math-qrels.dat
100%|█████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 343345.12it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 4251/4251 [00:00<00:00, 342252.50it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math_simto-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 654/654 [00:00<00:00, 314428.57it/s]
Retrieving all paragraph identifiers, and positions from ntcir-10-converted
get_all_identifiers(ntcir-10-converted): 5405167it [04:30, 19946.57it/s]
get_all_positions(ntcir-10-converted): 100%|█████████| 5405167/5405167 [08:44<00:00, 10306.72it/s]
Retrieving all paragraph identifiers, and positions from ntcir-11-12
get_all_identifiers(ntcir-11-12): 8301578it [08:08, 16985.19it/s]
get_all_positions(ntcir-11-12): 100%|█████████████████| 8301578/8301578 [44:30<00:00, 3108.88it/s]
1043 / 3146 / 5405167 relevant / judged / total identifiers in dataset ntcir-10-converted
1742 / 7059 / 8301578 relevant / judged / total identifiers in dataset ntcir-11-12
Pickling positions.pkl.gz
Fitting density, and probability estimators
Fitting prior p(position) density estimator
Fitting conditional p(position | relevant) density estimator
Computing density, and probability estimates
p(position): 100%|████████████████████████████████████████████████| 64/64 [01:19<00:00, 1.24s/it]
p(position | relevant): 100%|█████████████████████████████████████| 64/64 [01:19<00:00, 1.24s/it]
Pickling estimates.pkl.gz
Plotting plot.svg
Plotting plot.pdf
Extracting density, and probability estimates using 64 worker processes:
$ ntcir-math-density --num-workers 64 \
> --datasets A=ntcir-10-converted B=ntcir-11-12 \
> --judgements A:NTCIR_10_Math-qrels_fs-converted.dat A:NTCIR_10_Math-qrels_ft-converted.dat \
> B:NTCIR11_Math-qrels.dat B:NTCIR12_Math-qrels_agg.dat \
> B:NTCIR12_Math_simto-qrels_agg.dat \
> --estimates estimates.pkl.gz --positions positions.pkl.gz
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_fs-converted.dat
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 334959.05it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR_10_Math-qrels_ft-converted.dat
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 353201.94it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR11_Math-qrels.dat
100%|█████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 343345.12it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 4251/4251 [00:00<00:00, 342252.50it/s]
Retrieving judged paragraph identifiers, and scores from NTCIR12_Math_simto-qrels_agg.dat
100%|█████████████████████████████████████████████████████| 654/654 [00:00<00:00, 314428.57it/s]
Retrieving all paragraph identifiers, and positions from ntcir-10-converted
get_all_identifiers(ntcir-10-converted): 5405167it [04:30, 19946.57it/s]
get_all_positions(ntcir-10-converted): 100%|█████████| 5405167/5405167 [08:44<00:00, 10306.72it/s]
Retrieving all paragraph identifiers, and positions from ntcir-11-12
get_all_identifiers(ntcir-11-12): 8301578it [08:08, 16985.19it/s]
get_all_positions(ntcir-11-12): 100%|█████████████████| 8301578/8301578 [44:30<00:00, 3108.88it/s]
1043 / 3146 / 5405167 relevant / judged / total identifiers in dataset ntcir-10-converted
1742 / 7059 / 8301578 relevant / judged / total identifiers in dataset ntcir-11-12
Pickling positions.pkl.gz
Fitting density, and probability estimators
Fitting prior p(position) density estimator
Fitting conditional p(position | relevant) density estimator
Computing density, and probability estimates
p(position): 100%|████████████████████████████████████████████████| 64/64 [01:19<00:00, 1.24s/it]
p(position | relevant): 100%|█████████████████████████████████████| 64/64 [01:19<00:00, 1.24s/it]
Pickling estimates.pkl.gz
Plotting the estimates using 64 worker processes:
$ ntcir-math-density --num-workers 64 \
> --estimates estimates.pkl.gz --plots plot.pdf plot.svg
Unpickling estimates.pkl.gz
Plotting plot.svg
Plotting plot.pdf
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for ntcir_math_density-0.2.0-py3.5.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4adcfd1748eddc9b9e62891dc84b8ee6a82f53dbd7ae45a6e84c2dd326ec226 |
|
MD5 | 7923c61e1cd3fee623438b6450c64ad8 |
|
BLAKE2b-256 | 0d5ad67a48b6a32b7aa94d93f24df2b8423b619facdbf5d7683915b4b7e6217d |
Hashes for ntcir_math_density-0.2.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6c12bda62ec0d2e04e5740c44ba408c8aaef865ddbedc8c4930ad1630cd558a |
|
MD5 | 1770f30f7c5f090ef87b7454bab2e2ed |
|
BLAKE2b-256 | f8597d175902e6d309eae2a1a6b26432a171f21481945c0979812239979f5b2b |