package for Question-Answer driven Semantic Role Labeling for Nominalizations (QANom)
Project description
QANom - Annotating Nominal Predicates with QA-SRL
This repository is the reference point for the data and software described in the paper QANom: Question-Answer driven SRL for Nominalizations (COLING 2020).
Pre-requisite
- Python 3.7
Installation
From pypi:
pip install qanom
If you want to install from source, clone this repository and then install requirements:
git clone https://github.com/kleinay/QANom.git
cd QANom
pip install requirements.txt
Dataset
The original QANom Dataset can be downloaded from this google drive directory.
Alternatively, if you are working with Huggingface's Datasets library or are willing to install it (pip install datasets
),
you can retrieve the QANom datasets by:
import datasets
qanom_dataset = datasets.load_dataset('biu-nlp/qanom')
Crowdsourcing QANom via MTurk
The QANom dataset was collected through Amazon Mechanical Turk (MTurk). It was annotated by pre-selected crowd workers who exhibited good performance when previously annotating QA-SRL. Workers first thoroughly read and comprehend the annotation guidelines - both for question generation and for QA consolidation.
We adjusted (on a side branch) the qasrl-crowdsourcing software in order to run the QANom Annotation task interface on MTurk. The QANom annotation pipeline is different from QA-SRL pipeline in three aspects:
- QANom annotation is running after raw sentences have been preprocessed with the
candidate_extraction
module, which results in a JSON file describing the nominalization candidates and their heuristically extracted verbal counterparts (verb_form
). In our branch, theqasrl-crowdsourcing
system is expecting this JSON input instead of raw text input. - The annotation tasks are running on MTurk, but only workers granted with our Qualifications can work on the task. There are unique qualifications for the different annotation life-cycle phases (Trap, Training, Production) as well of for the different tasks (Generation, Consolidation).
- In the QANom task interface, the QA-SRL annotation phase (i.e. generating QA-SRL questions and highlighting their answers in the sentence) is preceded with a predicate detection yes/no question.
Note: the original qasrl-crowdsourcing package has been refactored since our usage, which might result in some installation errors. Feel free to contact us for guidance.
Candidate Extraction
You can use the candidate extraction module to for several purposes:
- As a pre-processing for the QANom crowd-annotation task
- As a pre-processing for the
predicate_detector
model - For custom usage.
Module-specific pre-requisite
pip install pandas nltk git+git://github.com/pattern3/pattern
Usage
python qanom/extract_candidates.py <input_file> <output_file> --read csv|josnl|raw --write csv|json [--no-wordnet] [--no-catvar] [--no-affixes]
The script handle three input formats:
csv
(default): a comma-separated file, with asentence
column stating the raw string of the sentence, and asentence_id
orqasrl_id
column for sentence identifiers.jsonl
: a JSON-lines file, similar to AllenNLP predictors' inputs- each line is{"sentence": <actual sentence string> }
.raw
: text file, each sentence is in a new line.
The script handle two output formats:
json
(default): used as input in theqasrl-crowdsourcing
system when crowdsourcing QANom annotations.csv
: QANom Dataset common format. This is the format which thepredicate_detector
model expects as input.
By default, the module uses (the union of) all three filters - wordnet
, catvar
, and affixes_heuristic
(see specification below).
One can deactivate a filter using the [--no-wordnet] [--no-catvar] [--no-affixes]
boolean flags.
Implementation Details:
The entry point is the module file qanom\candidate_extraction\candidate_extraction.py
.
The module uses a POS tagger to pos_tag
the sentence, and filter outs anything except common nouns (get_common_nouns
).
Then, it applies another filter - an "or" combination of two kinds of lexical-based filtering algorithms:
- Lexical Resources based - Use WordNet & CatVar derivations. Any noun with verbal related derivation would be predicted as nom.
- Affixes + seed based - Create a (possible-nominalization -> verb) list out of a verb seed list, using simple nominalization-suffixes substitution rules. Being in the list will be considered as being a nominalization candidate.
Filter 1 (wordnet_util.py
+ catvar.py
) requires
wordnet (available via nltk) and CatVar.
Run ./scripts/download_catvar.sh
for downloading CatVar into the resources
directory.
Filter 2 (verb_to_nom.py
) uses pattern.en package (pip install git+git://github.com/pattern3/pattern
).
The package is not maintained by the authors and contain a few minor errors that is easy to fix on your local installation.
The version we are installing here works only for Windows. on Linux, use --no-affixes
to disable this filter.
If there are multiple derivationally related verbs, we select the verb that minimizes the edit distance with the noun.
Models
The instructions in this section assume you have cloned the QANom repo, and your working directory is the QANom directory.
QANom Predicate Detector
The predicate_detector
classifies nominalization candidates (extracted with the candidate_extraction
module) as verbal vs. non-verbal.
We supply a model based on a vanilla BERT-based model
implemented by fine-tuning bert-base-cased pre-trained model on QANom dataset.
- Format data to generate files in CoNLL format given the CSV files produced during candidate extraction.
python qanom/predicate_detector/prepare_qanom_data.py [--INPUT_DIR input_dir] [--OUTPUT_DIR
output_dir]
- If you want to train a new model (else, you can skip to the next step and use the pretrained model):
sh qanom/predicate_detector/train_nom_id.sh
- Predict using a trained model:
sh qanom/predicate_detector/predict_nom_id.sh
- Convert CoNLL file produced by predicate detector to QANom's CSV format given the CSV input file: produced during candidate extraction.
python qanom/predicate_detector/convert_conll_to_qanom_csv.py [--INPUT_CONLL_FILE input_conll_file]
[--INPUT_CSV_FILE input_csv_file]
[--OUTPUT_FILE output_file]
QANom Baseline parser
The qanom_parser
is essentially the nrl-qasrl parser for QA-SRL, presented in
Large-Scale QA-SRL Parsing (FitzGerald et. al., 2018).
To adapt the parser to QANom specifications (e.g. that the verb in the question is not the predicate itself)
and format (csv), we have our own qanom
branch on the nrl-qasrl
repository. This branch uses the qanom
package.
Run ./scripts/setup_parser.sh
to clone the parser into qanom_parser
directory and prepare its prerequisites.
Then cd qanom_parser
to run model-related commands as those described for the rest of this section.
Training models
Follow the README
in qanom_parser
for instructions on training new verbal QA-SRL models.
A QANom parser is trained using a CSV file (QANom format) as input, with the QANomReader
DatasetReader
(in nrl/data/dataset_readers/qanom_reader.py
).
You should specify the path of the input files in the jsonnet
config files.
For example:
# first train a span-predictor for identifying answer spans (i.e. arguments)
allennlp train configs/train_qanom_span_elmo.jsonnet --include-package nrl -s ../models/<span-model-name>
# then train the question-generator model, predicting QA-SRL question slots given an answer-span
allennlp train configs/train_qanom_quesgen_bert.jsonnet --include-package nrl -s ../models/<quesgen-model-name>
# Combine span-model and quesgen model into one model, which can then be run for prediction
python scripts/combine_models.py --span ./models/<span-model-name> --ques ../models/<quesgen-model-name> --out ../models/<full-model-name>.tar.gz
If you want to use the trained parsers from the Large Scale QA-SRL (2018) and the QANom (2020) papers,
run ./qanom_parser/scripts/download_pretrained.sh
. This downloads both qasrl_parser_elmo
and qanom_parser_elmo
full models into ./models
directory (which is where we suggest to put your own models if you train any).
Inference
To run prediction on new texts, you can use the allennlp predict
command:
allennlp predict <model-dir-or-archive> <input-file> --include-package nrl --predictor qanom_parser --output-file <output-file>
This takes a JSON-lines , with one line for each sentence, in the following format:
{"qasrl_id": "Wiki1k:wikinews:1007169:1:0", "sentence": "She said in a statement : `` With an amazing portfolio of cars and trucks and the strongest financial performance in our recent history , this is an exciting time at today 's GM .", "predicate_indices": [4, 19], "verb_forms": ["state", "perform"]}
where "qasrl_id" is optional (and can be alternatively named "sentence_id" or "SentenceId").
Notice this input format requires more information than the qasrl_parser
predictor
(i.e., additional "predicate_indices" and "verb_forms" fields).
This is because it expects a predicate detector module to pre-identify the nominal predicates
for which it will generate QA annotations, along with their corresponding verbs.
Also note that no tokenizer model is applied on the sentence string -
we assume the sentence is pre-tokenized (and joined with spaces).
The output-file will also be a JSON-lines file, in the following format:
{
"words": ["She", "said", "in", "a", "statement", ":", "``", "With", "an", "amazing", "portfolio", "of", "cars", "and", "trucks", "and", "the", "strongest", "financial", "performance", "in", "our", "recent", "history", ",", "this", "is", "an", "exciting", "time", "at", "today", "'s", "GM", "."],
"verbs": [
{
"verb": "state",
"qa_pairs": [
{
"question": "Who stated something?",
"spans": [{"start": 0,"end": 0,"text": "She","score": 0.42914873361587527}],
"slots": {"wh": "who","aux": "_","subj": "_","verb_slot_inflection": "Past","obj": "something","prep": "_","obj2": "_","is_passive": "False","is_negated": "False"}
}
],
"index": 4
},
...
],
"qasrl_id": "Wiki1k:wikinews:1007169:1:0"
}
This is the same output format as of the QA-SRL parser, which is why predicates are called "verbs" even though for QANom they are nominal.
Predicting from and to QANom CSV format
For running the QANom predictor on CSV-formatted input file - as those outputted by predicate_detector
,
with nominal predicate information (crucially, target_idx
and is_verbal
columns) - run:
python scripts/convert_csv_to_jsonl_input_for_predictor.py <qanom-predicate-data.csv>
This will generate a file in the JSON-lines format expected by qanom_predictor
.
The output file would have the same name as the input except for the file extension (qanom-predicate-data.jsonl
).
To convert the predictor's output back into QANom's CSV format, run:
python scripts/convert_predictor_output_to_csv.py <predicted-qanom.jsonl>
Similarly, this would generate a predicted-qanom.csv
file in a format equivalent to the QANom Dataset files.
Evaluate
Given two QANom CSV files, e.g. predicted.csv
and gold.csv
, evaluate predicted with gold as reference using the command:
python qanom/evaluate predicted.csv gold.csv
Run QANom end-to-end pipeline
todo
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file qanom-0.0.11.tar.gz
.
File metadata
- Download URL: qanom-0.0.11.tar.gz
- Upload date:
- Size: 391.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e605eca51fe6566970cdc0f18cc3dee4f1c3e8da0c3f998e256378b63b91691 |
|
MD5 | 0186d967a055256b5c75ce9ab5040153 |
|
BLAKE2b-256 | ebb128afd56e68436732dc8f0668ddc83e6ebc1c6fe99c07a4a76fc8c949f19c |