trec_dd

TREC Dynamic Domain (DD) evaluation test harness for simulating user interaction with a search engine

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Topic
- Utilities

Project description

trec-dd-simulation-harness

This is the official “jig” for simulating a user interacting with a TREC DD system during an interactive query session.

Usage

After installation (see below), you will want run the harness to generate a run file, you want to use the trec_dd_harness command.

The purpose of this harness is to interact with your TREC DD system by issuing queries to your system, and providing feedback (truth data) for the results produced by your system. While it does this, it keeps track of the results produced by your system in a run file. After generating a run file with this harness, you can score the run using trec_dd_scorer

The harness is run via three commands: start, step, stop. Typically, a system will invoke start, then invoke step multiple times, and then invoke stop. Every invocation must include the -c argument with a path to a valid config.yaml file, as illustrated in example/config.yaml.

For efficiency, the first time you run with a new configuration, the truth data must be loaded into your database using the load command.

(Postgres example, using psql)
CREATE USER trec_dd_user PASSWORD 'some_password';
CREATE DATABASE trec_dd OWNER trec_dd_user

(back at the Unix shell)
(set up config.yaml to point to the database and the truth data file)
trec_dd_harness -c config.yaml load

By default, when you score a system using the harness, all of the topics are applied to the system in an order selected by the harness. You can limit the topic_ids that are used by specifying the topic_ids property in the config.yaml

The harness keeps track of the topic_ids that have not yet been used in building your system’s run file. To reset this state, you must run the init command.

To progress through the topics, your system must execute this double while loop, which is exactly what is implemented in the trec_dd/system/ambassador_cli.py example:

`init`
while 1:
    topic_id <-- `start`
    if topic_id is None: break
    while 1:
        results <-- run your system
        feedback <-- `step(results)`
        if feedback is None or len(feedback) < batch_size:
            break
        else:
            your system processes the feedback
    `stop`

Each of the five commands returns a JSON dictionary which your system can read using a JSON library. After a step command, the response looks like:

[
 {
     "topic_id": "DD15-1"
     "confidence": 0.987,
     "on_topic": 1,
     "stream_id": "1335424206-b5476b1b8bf25b179bcf92cfda23d975",
     "subtopics": [
         {
             "passage_text": "this is a passage of relevant text from the document 'stream_id', relevant to the 'subtopic_id' below with the 'rating' below",
             "rating": 3,
             "subtopic_id": "DD15-1.4",
             "subtopic_name": "a label for this subtopic"
         }
     ],
 },
 { ... }
]

The harness always provides feedback for every result, even if the feedback is that the system has no truth data for that result. Note that your use of the harness must call stop in the next iteration after any step in which you submit fewer than batch_size results. If you fail to do this, the harness will exit.

See trec_dd/system/ambassador_cli.py for an example of using the harness from python.

The harness outputs a runfile, whose path is set in the configuration file.

To score a runfile (see “Scoring the System”):

trec_dd_scorer -c config.yaml run_file_in.txt run_file_scored.json > pretty_table.txt 2> log.txt &

run_file_in.txt is the run file output by the harness. The scorer outputs a scored run file in run_file_scored.json, and scores to stdout.

This repository also provides a baseline system that randomizes subtopic ordering (see “Example TREC DD Systems”). In particular this baseline system shows how to hook an a system up to the jig in python. Hooking a system up to the jig via the command line is further documented below.

trec_dd_random_system -c config.yaml &> log.txt &

The scores for this baseline system using the TREC DD truth data are:

Score	Metric
0.438	average_err_arithmetic
0.298	average_err_harmonic
0.125	modified_precision_at_recall
0.981	precision_at_recall
0.075	reciprocal_rank_at_recall

Installation

The recommended way to install and use the scorer is with python virtualenv, which is a standard tool on all widely used platforms. For example on Ubuntu:

apt-get install python-virtualenv
virtualenv vpy

or on CentOS:

yum install python-virtualenv
virtualenv vpy

or on MacOS X

brew install pyenv-virtualenv
pyenv-virtualenv vpy

or on Windows.

You will also need a database. We recommend postgres or mysql. You can install this on your system using standard tools. The connection information must be written into the config.yaml file referenced in the commands above. See config.yaml for an example.

Once you have a virtualenv, the following commands will install the trec_dd scorer. You should choose whether you are using mysql or postgres and specify that as a pip extras declaration in square brackets as follows:

. vpy/bin/activate
pip install trec_dd[mysql]

or to use postgres:

. vpy/bin/activate
pip install trec_dd[postgres]

That will create the shell entry points for running the two commands illustrated at the top of this file.

Simulation Harness

If you wish to evaluate a TREC DD system, you must run it against the TREC DD simulation harness. A system interacting with the simulation harness will produce a “runfile” that summarizes the simulation session. The “runfile”, for each of the system’s response, encodes information such as (1) “was the system’s response on topic?” (2) “what subtopics were contained within the system’s response?” and (3) “how relevant was the system’s response?”. Please see the specification for a “runfile” for more information.

A TREC DD system interacts with the simulation harness by invoking commands at the command line. Systems written in python may use the HarnessAmbassadorCLI to facilitate this communication. The HarnessAmbassadorCLI is also useful documentation for how one should interact with the harness via the command line.

Once you have a “runfile”, you may then score your run. Please see the section “Gathering Scores” for more information.

Example TREC DD Systems

The directory trec_dd/system holds example TREC DD systems to demonstrate interaction with the simulation harness using a TREC DD system. Right now, the only example system is random_system.py.

Executing the Random System

Requirements

To run the example systems, you must have a truth data XML file. Make sure your database is set up as per your config.yaml, and load the truth data into the database:

trec_dd_harness -c config.yaml load

Running the System

You can run the random system in the simulation harness by calling

trec_dd_random_system -c config.yaml >log.txt 2>&1

After this command executes, you should find the resulting system runfile at the path you specified in the configuration. The runfile summarizes the responses the random system gave to the harness, as well as the harness’s thoughts on those responses. This runfile captures everything one needs to know in order to give a system a score.

Scoring the System

To score your runfile, you may use the trec_dd/scorer/run.py script.

trec_dd_scorer -c config.yaml run_file_in.txt run_file_scored.json > pretty_table.txt 2> log.txt &

Please see the section titled “Gathering Scores” for more information on the scoring subsystem.

Gathering Scores

Requirements

You must have a runfile generated for your system if you wish to score it. You must also have access to the truth data used by the harness when generating the runfile.

Running the Scorer

There are two scoring scripts used to compute evaluation scores. bin/cubeTest.pl is used to compute Cube Test results. To run it:

::: bin/cubeTest.pl cubetest-qrels runfile cutoff

where runfile is the output runfile from the jig, cubetest-qrels is a specially-formatted version of the truth data (and available from the same place), and cutoff is the number of iterations for running the Cube Test.

trec_dd/scorer/run.py is used to generate other evaluation scores including u-ERR. To run it:

trec_dd_scorer -c config.yaml run_file_in.txt run_file_scored.json > pretty_table.txt 2> log.txt &

This will go through your runfile and run each configured TREC DD scorer. run_file_in.txt is the runfile produced as output by the harness. The scorer outputs an annotated version of your run in run_file_scored.json, and the scores to stdout.

If you wish to run specific scorers, rather than all of them, please see the ‘–scorer’ option on the trec_dd_scorer command. The scorers specified after the –scorer option must be the names of scorers known to the system. These are exactly the following:

reciprocal_rank_at_recall
precision_at_recall
modified_precision_at_recall
average_err_arithmetic
average_err_harmonic

Description of Scorers

The Cube Test is a search effectiveness measurement that measures the speed of gaining relevant information (could be documents or passages) in a dynamic search process. It measures the amount of relevant information a search system could gather for the entire search process with multiple runs of retrieval. A higher Cube Test score means a better DD system, which ranks relevant information (documents and/or passages) for a complex search topic as much as possible and as early as possible.
reciprocal_rank_at_recall calculates the reciprocal of the rank by which every subtopic for a topic is accounted for.
precision_at_recall calculates the precision of all results up to the point where every subtopic for a topic is accounted for.
average_err_arithmetic calculates the expected reciprocal rank for each subtopic, and then average the scores accross subtopics using an arithmetic average. It uses a graded relevance for computing stopping probabilities.
average_err_harmonic calculates the expected reciprocal rank for each subtopic, and then averages the scores accross subtopics using an arithmetic average. It uses graded relevance for computing stopping probabilities.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Topic
- Utilities

Release history Release notifications | RSS feed

0.3.7.dev4 pre-release

Oct 8, 2015

0.3.7.dev3 pre-release

Oct 8, 2015

0.3.7.dev2 pre-release

Oct 8, 2015

0.3.7.dev1 pre-release

Sep 18, 2015

This version

0.3.6

Jul 22, 2015

0.3.6.dev3 pre-release

Jul 22, 2015

0.3.5

Jul 15, 2015

0.3.5.dev2 pre-release

Jul 15, 2015

0.3.4

Jul 8, 2015

0.3.4.dev1 pre-release

Jul 8, 2015

0.3.3.dev9 pre-release

Jun 29, 2015

0.3.3.dev8 pre-release

Jun 26, 2015

0.3.3.dev7 pre-release

Jun 25, 2015

0.3.3.dev6 pre-release

Jun 23, 2015

0.3.3.dev5 pre-release

Jun 23, 2015

0.3.3.dev4 pre-release

Jun 23, 2015

0.3.3.dev1 pre-release

Jun 23, 2015

0.3.2

Jun 16, 2015

0.3.1

Jun 12, 2015

0.3.1.dev5 pre-release

Jun 12, 2015

0.3.1.dev4 pre-release

Jun 12, 2015

0.3.1.dev2 pre-release

Jun 12, 2015

0.2.2.dev26 pre-release

Jun 11, 2015

0.2.2.dev24 pre-release

Jun 11, 2015

0.2.2.dev23 pre-release

Jun 11, 2015

0.2.2.dev22 pre-release

Jun 11, 2015

0.2.2.dev21 pre-release

Jun 11, 2015

0.2.2.dev20 pre-release

Jun 11, 2015

0.2.2.dev19 pre-release

Jun 11, 2015

0.2.2.dev17 pre-release

Jun 11, 2015

0.2.2.dev16 pre-release

Jun 11, 2015

0.2.2.dev11 pre-release

Jun 11, 2015

0.2.2.dev10 pre-release

Jun 11, 2015

0.2.2.dev9 pre-release

Jun 11, 2015

0.2.2.dev8 pre-release

Jun 10, 2015

0.2.2.dev7 pre-release

Jun 10, 2015

0.2.2.dev6 pre-release

Jun 10, 2015

0.2.2.dev5 pre-release

Jun 10, 2015

0.2.2.dev1 pre-release

Jun 9, 2015

0.2.1

Mar 24, 2015

0.2.1.dev16 pre-release

Feb 19, 2015

0.2.1.dev15 pre-release

Feb 17, 2015

0.2.1.dev13 pre-release

Feb 16, 2015

0.2.1.dev12 pre-release

Feb 13, 2015

0.2.0

Jan 17, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trec_dd-0.3.6.tar.gz (27.6 kB view details)

Uploaded Jul 22, 2015 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trec_dd-0.3.6-py2.7.egg (57.3 kB view details)

Uploaded Jul 22, 2015 Egg

File details

Details for the file trec_dd-0.3.6.tar.gz.

File metadata

Download URL: trec_dd-0.3.6.tar.gz
Upload date: Jul 22, 2015
Size: 27.6 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for trec_dd-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`dcdb2d89e3f7ef4a487c3c3fe3b5ca3d7be7f85455624b4a5921ef86a724e1ba`
MD5	`ddf2aaf1607fe30ba4aa7579172f975e`
BLAKE2b-256	`8a8ca76dd304055e8347b3c29774145253c65444d8c320688b950cdab93cc2fe`

See more details on using hashes here.

File details

Details for the file trec_dd-0.3.6-py2.7.egg.

File metadata

Download URL: trec_dd-0.3.6-py2.7.egg
Upload date: Jul 22, 2015
Size: 57.3 kB
Tags: Egg
Uploaded using Trusted Publishing? No

File hashes

Hashes for trec_dd-0.3.6-py2.7.egg
Algorithm	Hash digest
SHA256	`9537137c845464f463afe9836594b912d1697524eb16404028105aefd997a63c`
MD5	`1800bf52b70c01d974c808db65c08401`
BLAKE2b-256	`933ef3200b5393704106c65043397268c94988b41862b91c3f82564b7a37350f`

See more details on using hashes here.

trec_dd 0.3.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

trec-dd-simulation-harness

Usage

Installation

Simulation Harness

Example TREC DD Systems

Executing the Random System

Requirements

Running the System

Scoring the System

Gathering Scores

Requirements

Running the Scorer

Description of Scorers

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes