Skip to main content

TREC Dynamic Domain (DD) evaluation test harness for simulating user interaction with a search engine

Project description

# trec-dd-simulation-harness

This is the official "jig" for simulating a user interacting with a TREC DD system during an interactive query session.

# Example Recommender Systems

The directory trec\_dd/system holds example recommender systems to demonstrate usage
of the simulation harness. Right now, the only example system is
random_system.py.

# Executing the Random Recommender System

## Requirements

To run the random recommender system, you must have truth data stored
in a file-backed kvlayer store. The truth data must be stored as
dossier.label Label objects, using a LabelStore. We have a utility
script 'trec\_dd/harness/generate\_labels\_from\_runfile.py', which should
help you convert your truth data into the required format, if all you
have is a runfile generated by human assessors.

You also need a "topic sequence" file that describes *how* you want to
evaluate your system. The "topic sequence" file specifies which topics
to explore, and how many batches to request for each topic. The file
should be in yaml, and simply be a mapping from topic\_id to an
integer representing how many batches to execute for that
topic\_id. You can find an example "topic sequence" file at
trec_dd/system/example\_topic\_seq.py.

## Running the System

You can run the random recommender system in the simulation harness by
calling

python random_system.py path/to/topic_sequence.yaml path/to/truth_data.kvl path/to/runfile_out.runfile

After this command executes, you should find the resulting system
runfile at the path you specified in the command. The runfile summarizes
the responses the random system gave to the harness, as well as the harness's
thoughts on those responses. This runfile captures everything one needs to
know in order to give a system a score.

## Scoring the System

To score your runfile, you may use the trec_dd/scorer/run.py script.

python trec_dd/scorer/run.py path/to/runfile path/to/truthdata.kvl --scorer scorer1 scorer2 scorer3 ...

Please see the section titled "Gathering Scores" for more information on the scoring
subsystem.

# Gathering Scores

## Requirements

You must have a runfile generated for your system if you wish to score
it. You must also have access to the truth data used by the harness
when generating the runfile.

## Running the Scorer

The top-level scoring script trec\_dd/scorer/run.py is used to generate
scores. To run it:

python run.py path/to/runfile path/to/truthdata.kvl --scorer scorer1 scorer2 ...

This will go through your runfile and use all of the specified scorers to
evaluate the run of your system. The scorers specified after the --scorer
option must be the names of scorers known to the system. These are
exactly the following:

* reciprocal\_rank\_at\_recall
* precision\_at\_recall
* modified\_precision\_at\_recall
* average\_err\_arithmetic
* average\_err\_harmonic
* average\_err\_arithmetic\_binary
* average\_err\_harmonic\_binary

# Description of Scorers

* reciprocal\_rank\_at\_recall calculates the reciprocal of the rank by which
every subtopic for a topic is accounted for.

* precision\_at\_recall calculates the precision of all results up to the point
where every subtopic for a topic is accounted for.

* average\_err\_arithmetic calculates the expected reciprocal rank
for each subtopic, and then average the scores accross subtopics
using an arithmetic average. It uses a graded relevance for computing
stopping probabilities.

* average\_err\_arithmetic\_binary calculates the expected reciprocal
rank for each subtopic, and then averages the scores accross
subtopics using an arithmetic average. It uses binary relevance for
computing stopping probabilities. Hence, this scorer ignores the
'rating' field in the runfile.

* average\_err\_harmonic calculates the expected reciprocal rank for
each subtopic, and then averages the scores accross subtopics using
an arithmetic average. It uses graded relevance for computing
stopping probabilities.

* average\_err\_harmonic\_binary average\_err\_harmonic calculates the expected reciprocal rank for
each subtopic, and then averages the scores accross subtopics using
an arithmetic average. It uses binary relevance for computing stopping probabilities. Hence,
this scorer ignores the 'rating' field in the runfile.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trec_dd-0.2.2.dev9.tar.gz (14.7 kB view hashes)

Uploaded Source

Built Distribution

trec_dd-0.2.2.dev9-py2.7.egg (43.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page