firs: a Framework for Information Retrieval Systems
Project description
firs: a (python) Framework for Information Retrieval Systems.
Firs is a python package, based on pyterrier, developed to help experimentation in Information Retireval.
Firs have multiple functions:
- It allows to import and evalutate traditional TREC collections
- It allows to compute an experimental Grid of Points (GoP)
- It allows to compute and handle replicates (such as shards or reformulations)
Setting firs up
Install
To install firs, use the pip command:
pip install firs
The configuration file
To work, firs relies on a configuration file. The configuration file needs a section for the paths and a section for each of the collections that you want to work on. In the "path" section, it is mandatory to specify the path to the jdk. Notice that, firs is based on pyterrier and therefore requires a jdk ≥ 11.
an example of configuration file:
[paths]
JAVAHOME = /usr/lib/jdk-11.0.11
[collections.robust04]
runs_path = ./data/TREC/TREC_13_2004_Robust/runs/
qrel_path = ./data/TREC/TREC_13_2004_Robust/pool/qrels.robust2004.txt
coll_path = ./EXPERIMENTAL_COLLECTIONS/TIPSTER/CORPUS
shrd_path = ./data/shardings/
[collections.trec08]
runs_path = ./data/TREC/TREC_08_1999_AdHoc/runs/all/
qrel_path = ./data/TREC/TREC_08_1999_AdHoc/pool/qrels.trec8.adhoc.txt
shrd_path = ./data/shardings/
Non-public elements, such as the qrels, are not provided by firs. They need to be placed in the path specified in the configuration file. In any cases, firs can used to build runs and grid of points starting from a collection.
Initializing firs
Once the configuration file is ready, it is possible to start working with firs.Import firs and configure it:
import firs
firs.configure(<path to configuration file>)
firs as Collections Manager
Importing a Collection
To import a trec collection, run
#import the metadata of the collection
collection = firs.TrecCollection(collectionName=<name of the collection>)
#import the collection: the operation might be very time consuming
collection = collection.import_collection()
The function import_collection
takes nThreads
as additional parameter to import the runs in a parallel fashon. If you want to import the runs using 10 processors, do:
collection = collection.import_collection(nThreads=10)
Computing measures
To compute the measures on the selected collection, using the given qrels, run: ``` measures = collection.evaluate() ``` Notice that, this command assumes to have the full collection available (qrels alongside runs) and imported.In some cases, the number of runs might be extremely high and it might be preferable to compute the measure run by run on the fly, avoiding to load all the runs. By running
measures = collection.parallel_evalutate(nThreads=<number of threads>)
It is possible to compute the measure in a parallel fashon and without preloading all the runs.
Finally, it might be preferable, if available, to directly import a measure file. The path to the measure file need to be specified in the configuration file, under the proper collection, using the label msrs_path=path to the csv containing the measures
.
Using the command:
measures = collection.import_measures()
It is possible to directly import the proper measure file.
Notice that, using either parallel_evaluate
and import_measures
there is no need to run import_collection
on the collection object before.
Replicates
Replicates represent multiple istances of the same experiment. An experiment is characterized by a subject (in IR, usually a topic) and the experimental conditions (in IR, usually the system used). Several approaches have been proposed to obtain the replicates. The simplest possible consists in considering human-made query reformulations. Note that, we do not provide any kind of dataset containing replicates: we only provide a strategy to handle them. A second approach consists in using reformulations.
Shardings
The sharding procedure consists in inflating the number of observations by splitting the corpora into multiple subcorpora and running a specific experiment (a system applied to a specific query) multiple times, over each of the subcorpora.
A sharding on a collection is characterized by 3 elements:
- The number of shards
- The number of documents in each shard
- Whether shards are allowed or not
By calling:
sharding = firs.Shuttering(collection, sampling=<type of sampling>, nShards=<number of shards>, emptyShards=<empty label>)
Is it possible to obtain a sharding of the collection. A sharding is practically identical to a collection object, with the difference that both the qrel and the runs are splitted according to a division of the collection into shard. The instruction:
sharded_measure = sharding.evaluate()
allows to evaluate the systems on the sharded collection.
Concerning the arguments passed to the constructor of the sharding, we have that:
sampling
: it can be eitherEVEN
where all the shards will be equal orRNDM
where different shards can heve different lenghtsnShards
: it needs to be an integer numberemptyShards
: it can have either one ofE
, which allows to have shards without any relevant document orNE
, in which every shard shoud have at least one relevant document for each of the topic
Reformulations
firs as Grid of Points (GoP) Experimental tool
The configuration file
Besides the information on the collection, to obtain a Grid of Points, the configuration file needs to be updated with some additional sections[GoP]
stoplists = <list of comma-separated stoplists>
stemmers = <list of comma-separated stemmers>
models = <list of comma-separated models>
queryexpansions = <list of comma-separated query expansion models>
[stoplists]
stoplist.stoplist1 = <path to the stoplist>
stoplist.stoplist2 = <path to the stoplist>
[stemmers]
stoplist.stemmer1 = <name of the terrier class implementing the stemmer 1>
stoplist.stemmer2 = <name of the terrier class implementing the stemmer 2>
[models]
[queryexpansions]
Use the keyword none
to avoid using a specific component (possible only for the stoplist, the stemmer and the query expansion model).
Indexing
Retrieving
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file firs-0.0.10.tar.gz
.
File metadata
- Download URL: firs-0.0.10.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01a26d8f28f13b1cde9df4294f94a61ee15eca9712d7d465e3ca250a6f44bb46 |
|
MD5 | f867a917578c1f6acf78221606c22625 |
|
BLAKE2b-256 | 1ace3fe86bb332fb43122da6d106b664d1daf3e6abe4f7e6b4a8ddd5f38a5e8e |
File details
Details for the file firs-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: firs-0.0.10-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 438ddde23b4bcee6b5f276db3f359d02c6dcc600b8dd74c966f9b2d8335fcf4c |
|
MD5 | 8f17ea8f975d93e30989459a8c5d334c |
|
BLAKE2b-256 | 7948973b7c549ebec2379e29a138411de30613eca574c52013c7bafd21a77f46 |