Skip to main content

No project description provided

Project description

Genie NLP library

Build Status Language grade: Python

This library contains the NLP models for the Genie toolkit for virtual assistants. It is derived from the decaNLP library by Salesforce, but has diverged significantly.

The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs:

  • text or structured input as context
  • text input as question
  • text or structured output as answer

As the work by McCann et al. shows, many NLP tasks can be framed in this way. Genie primarily uses the library for paraphrasing, translation, semantic parsing, and dialogue state tracking, and this is what the models work best for.

Installation

genienlp is available on PyPi. You can install it with:

pip3 install genienlp

After installation, genienlp command becomes available.

Usage

Training a semantic parser

The general form is:

genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> <flags>

The <datadir> should contain a single folder called "almond" (the name of the task). That folder should contain the files "train.tsv" and "eval.tsv" for train and dev set respectively.

To train a BERT-LSTM (or other MLM-based model) use:

genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \
  --model TransformerLSTM --pretrained_model bert-base-cased --trainable_decoder_embedding 50

To train a BART or other Seq2Seq model, use:

genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \
  --model TransformerSeq2Seq --pretrained_model facebook/bart-large --gradient_accumulation_steps 20

The default batch sizes are tuned for training on a single V100 GPU. Use --train_batch_tokens and --val_batch_size to control the batch sizes. See genienlp train --help for the full list of options.

NOTE: the BERT-LSTM model used by the current version of the library is not comparable with the one used in our published paper (cited below), because the input preprocessing is different. If you wish to compare with published results you should use genienlp <= 0.5.0.

Inference on a semantic parser

In batch mode:

genienlp predict --tasks almond --data <datadir> --path <model_dir> --eval_dir <output>

The <datadir> should contain a single folder called "almond" (the name of the task). That folder should contain the files "train.tsv" and "eval.tsv" for train and dev set respectively. The result of batch prediction will be saved in <output>/almond/valid.tsv, as a TSV file containing ID and prediction.

In interactive mode:

genienlp server --path <model_dir>

Opens a TCP server that listens to requests, formatted as JSON objects containing id (the ID of the request), task (the name of the task), context and question. The server writes out JSON objects containing id and answer. The server listens to port 8401 by default, use --port to specify a different port or --stdin to use standard input/output instead of TCP.

Calibrating a trained model

Calibrate the confidence scores of a trained model:

  1. Calcualate and save confidence features of the evaluation set in a pickle file:

    genienlp predict --task almond --data <datadir> --path <model_dir> --save_confidence_features --confidence_feature_path <confidence_feature_file>
    
  2. Train a boosted tree to map confidence features to a score between 0 and 1:

    genienlp calibrate --confidence_path <confidence_feature_file> --save <calibrator_directory> --name_prefix <calibrator_name>
    
  3. Now if you provide --calibrator_paths during prediction, it will output confidence scores for each output:

    genienlp predict --tasks almond --data <datadir> --path <model_dir> --calibrator_paths <calibrator_directory>/<calibrator_name>.calib
    

Paraphrasing

Train a paraphrasing model:

genienlp train-paraphrase --train_data_file <train_data_file> --eval_data_file <dev_data_file> --output_dir <model_dir> --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path <gpt2/gpt2-medium/gpt2-large/gpt2-xlarge>

Generate paraphrases:

genienlp run-paraphrase --model_name_or_path <model_dir> --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --batch_size 32 --input_file <input_tsv_file> --input_column 1

Named Entity Disambiguation

First run a bootleg model to extract mentions, entity candidates, and contextual embeddings for the mentions.

genienlp bootleg-dump-features --train_tasks <train_task_names> --save <savedir> --preserve_case --data <dataset_dir> --train_batch_tokens 400 --val_batch_size 400 --database_type json --database_dir <database_dir> --ned_features type_id type_prob --ned_features_size 1 1 --ned_features_default_val 0 1.0 --num_workers 0 --min_entity_len 1 --max_entity_len 4 --bootleg_model <bootleg_model>

This command generates several output files. In <dataset_dir> you should see a prep dir which contains preprocessed data (e.g. data converted to memory-mapped format, several array to facilitate embedding lookup etc.) If your dataset doesn't change you can reuse the same files. It will also generate several files in <results_temp> folder. In eval_bootleg/[train|eval]/<bootleg_model>/bootleg_lables.jsonl you can see the examples, mentions, predicted candidates and their probabilities according to bootleg.

Now you can use the extracted features from bootleg in downstream tasks such as semantic parsing to improve named entity understanding and consequently generation:

genienlp train --train_tasks <train_task_names> --train_iterations 60000 --preserve_case --save <savedir> --data <dataset_dir> --model TransformerLSTM --pretrained_model bert-base-uncased --trainable_decoder_embeddings 50 --train_batch_tokens 1000 --val_batch_size 1000 --do_ned --database_type json --database_dir <database_dir> --ned_retrieve_method bootleg --ned_features type_id type_prob --ned_features_size 1 1 --ned_features_default_val 0 1.0 --num_workers 0 --min_entity_len 1 --max_entity_len 4 --bootleg_model <bootleg_model>

See genienlp --help and genienlp <command> --help for more details about each argument.

Citation

If you use the MultiTask Question Answering model in your work, please cite The Natural Language Decathlon: Multitask Learning as Question Answering.

@article{McCann2018decaNLP,
  title={The Natural Language Decathlon: Multitask Learning as Question Answering},
  author={Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},
  journal={arXiv preprint arXiv:1806.08730},
  year={2018}
}

If you use the BERT-LSTM model (Identity encoder + MQAN decoder), please cite Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web

@InProceedings{xu2020schema2qa,
  title={{Schema2QA}: High-Quality and Low-Cost {Q\&A} Agents for the Structured Web},
  author={Silei Xu and Giovanni Campagna and Jian Li and Monica S. Lam},
  booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management},
  year={2020},
  doi={https://doi.org/10.1145/3340531.3411974}
}

If you use the paraphrasing model (BART or GPT-2 fine-tuned on a paraphrasing dataset), please cite AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data

@inproceedings{xu-etal-2020-autoqa,
    title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data",
    author = "Xu, Silei  and Semnani, Sina  and Campagna, Giovanni  and Lam, Monica",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.31",
    pages = "422--434",
}

If you use MarianMT/ mBART/ T5 for translation task, or XLMR-LSTM model for Seq2Seq tasks, please cite Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation and the original paper that introduced the model.

@inproceedings{moradshahi-etal-2020-localizing,
    title = "Localizing Open-Ontology {QA} Semantic Parsers in a Day Using Machine Translation",
    author = "Moradshahi, Mehrad and Campagna, Giovanni and Semnani, Sina and Xu, Silei and Lam, Monica",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = November,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.481",
    pages = "5970--5983",
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genienlp-0.6.0a4.tar.gz (159.6 kB view details)

Uploaded Source

Built Distribution

genienlp-0.6.0a4-py3-none-any.whl (209.6 kB view details)

Uploaded Python 3

File details

Details for the file genienlp-0.6.0a4.tar.gz.

File metadata

  • Download URL: genienlp-0.6.0a4.tar.gz
  • Upload date:
  • Size: 159.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.7

File hashes

Hashes for genienlp-0.6.0a4.tar.gz
Algorithm Hash digest
SHA256 814487c312a1c2b497d7d4f68b97b1cf20a84223c37f82dcd56e17a9d817f1d0
MD5 8c7d749b25f6dc66ebd9b7f3af6e0530
BLAKE2b-256 da8449adf440160963718269aa23930b1821f5f3d0b4f2fcb8457de6f414032f

See more details on using hashes here.

File details

Details for the file genienlp-0.6.0a4-py3-none-any.whl.

File metadata

  • Download URL: genienlp-0.6.0a4-py3-none-any.whl
  • Upload date:
  • Size: 209.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.7

File hashes

Hashes for genienlp-0.6.0a4-py3-none-any.whl
Algorithm Hash digest
SHA256 2c0d93451ffbe62a1f2d962cf32a6eb9577dc8dae349a9f23800ab341f72030c
MD5 8bd76031c9cdf70c6297a395d9c94313
BLAKE2b-256 c1c7acd832fa491a82a3fe77baa62b910ab8778775aa9e71f73d1eb77851d26d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page