Skip to main content

Direct Attentive Dependency Parser

Project description

DiaParser

build docs release downloads LICENSE

DiaParser provides a state-of-the-art direct attentive dependency parser based onthe Biaffine Parser (Dozat and Manning, 2017) architecture.

The parser can work directly on plain text or on tokenized text. The parser automatically dowloads pretrained models as well as tokenizers and produces dependency parsing trees, as detailed in Usage.

You can also train your own models and contribute them to the repository, to share with others.

DiaParser uses pretrained contextual embeddings for representing input from models in transformers.

Pretrained tokenizers are provided by Stanza.

Alternatively to contextual embeddings, DiaParser also allows to utilize CharLSTM layers to produce character/subword-level features. Both BERT and CharLSTM avoid the need of generating POS tags.

DiaParser is derived from SuPar, which provides additional variants of dependency and constituency parsers.

Contents

Installation

DiaParser can be installed via pip:

$ pip install -U diaparser

Installing from sources is also possible:

$ git clone https://github.com/Unipisa/diaparser && cd diaparser
$ python setup.py install

The package has the following requirements:

Performance

DiaParser provides pretrained models for English, Chinese and other 17 languages of the IWPT 2020 Shared task. English models are trained on the Penn Treebank (PTB) with Stanford Dependencies, with 39,832 training sentences, while Chinese models are trained on Penn Chinese Treebank version 7 (CTB7) with 46,572 training sentences. The other languages are trained on the Universal Dependencies treebanks v2.5.

The performance and parsing speed of these models are listed in the following table. Notably, punctuation is ignored in all evaluation metrics for PTB, but included in all the others. The numbers in bold represent state-of-the-art values.

Language Corpus Name UAS LAS Speed (Sents/s)
English PTB en_ptb.electra 96.03 94.37 352
Arabic PADT ar_padt.bert 87.75 83.25 99
Bulgarian BTB bg_btb.DeepPavlov 95.02 92.20 479
Czech PDT cs_pdt.DeepPavlov 94.02 92.06 403
English EWT en_ewt.electra 91.66 89.51 397
Estonian EDT, EWT et_edt.mbert 86.39 82.44 247
Finnish TDT fi_tdt.turkunlp 94.28 92.56 364
French sequoia fr_sequoia.camembert 92.81 89.55 200
Italian ISDT it_isdt.dbmdz 95.40 93.78
Latvian LVBT lv_lvtb.mbert 87.46 83.51 290
Lithuanian ALKSNIS lt_alksnis.mbert 80.09 75.14 290
Dutch Alpino nl_alpino.wietsedv 90.80 88.34 367
Polish PDB, LFG pl_pdb.dkleczek 94.38 91.70 563
Russian SynTagRus ru_syntagrus.DeepPavlov 94.97 93.72 445
Slovak SNK sk_snk.mbert 93.11 90.44 381
Swediskh Talbanken sv_talbanken.KB 90.79 88.08 491
Tamil TTB ta_ttb.mbert 74.20 66.49 175
Ukrainian IU uk_iu.TurkuNLP 90.39 87.61 301
Chinese CTB zh_ptb.hfl 92.14 85.74 319

These results were obtained on a server with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz and Nvidia T4 GPU.

Usage

DiaParser is very easy to use. You can download a pretrained model and run syntactic parsing over sentences with a few lines of code:

>>> from diaparser.parsers import Parser
>>> parser = Parser.load('en_ewt-electra')
>>> dataset = parser.predict([['She', 'enjoys', 'playing', 'tennis', '.']], prob=True, verbose=False)
100%|####################################| 1/1 00:00<00:00, 85.15it/s

The call to parser.predict will return an instance of diaparser.utils.Dataset containing the predicted syntactic trees. You can either access any sentence within the dataset or an individual field of all the tokens.

>>> print(dataset.sentences[0])
1       She     _       _       _       _       2       nsubj   _       _
2       enjoys  _       _       _       _       0       root    _       _
3       playing _       _       _       _       2       xcomp   _       _
4       tennis  _       _       _       _       3       dobj    _       _
5       .       _       _       _       _       2       punct   _       _

>>> print(f"arcs:  {dataset.arcs[0]}\n"
          f"rels:  {dataset.rels[0]}\n"
          f"probs: {dataset.probs[0].gather(1,torch.tensor(dataset.arcs[0]).unsqueeze(1)).squeeze(-1)}")
arcs:  [2, 0, 2, 3, 2]
rels:  ['nsubj', 'root', 'xcomp', 'dobj', 'punct']
probs: tensor([1.0000, 0.9999, 0.9642, 0.9686, 0.9996])

Probabilities can be returned along with the results if prob=True.

If there are plenty of sentences to parse, DiaParser also supports loading them from file, and saving the results to a file specified with option pred.

>>> dataset = parser.predict('data/ptb/test.conllx', pred='pred.conllx')
2020-07-25 18:13:50 INFO Loading the data
2020-07-25 18:13:52 INFO
Dataset(n_sentences=2416, n_batches=13, n_buckets=8)
2020-07-25 18:13:52 INFO Making predictions on the dataset
100%|####################################| 13/13 00:01<00:00, 10.58it/s
2020-07-25 18:13:53 INFO Saving predicted results to pred.conllx
2020-07-25 18:13:54 INFO 0:00:01.335261s elapsed, 1809.38 Sents/s

Please make sure the file is in CoNLL-X or CoNLL-U format. If some fields are missing, you can use underscores as placeholders. An interface is provided for converting a list of tokens to a string in CoNLL-X format.

>>> from diaparser.utils import CoNLL
>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.']))
1       She     _       _       _       _       _       _       _       _
2       enjoys  _       _       _       _       _       _       _       _
3       playing _       _       _       _       _       _       _       _
4       tennis  _       _       _       _       _       _       _       _
5       .       _       _       _       _       _       _       _       _

The CoNLL-U format for Universal Dependencies (UD) is also supported, with comments and extra annotations preserved and restored in the output.

>>> import os
>>> import tempfile
>>> text = '''# text = But I found the location wonderful and the neighbors very kind.
1\tBut\t_\t_\t_\t_\t_\t_\t_\t_
2\tI\t_\t_\t_\t_\t_\t_\t_\t_
3\tfound\t_\t_\t_\t_\t_\t_\t_\t_
4\tthe\t_\t_\t_\t_\t_\t_\t_\t_
5\tlocation\t_\t_\t_\t_\t_\t_\t_\t_
6\twonderful\t_\t_\t_\t_\t_\t_\t_\t_
7\tand\t_\t_\t_\t_\t_\t_\t_\t_
7.1\tfound\t_\t_\t_\t_\t_\t_\t_\t_
8\tthe\t_\t_\t_\t_\t_\t_\t_\t_
9\tneighbors\t_\t_\t_\t_\t_\t_\t_\t_
10\tvery\t_\t_\t_\t_\t_\t_\t_\t_
11\tkind\t_\t_\t_\t_\t_\t_\t_\t_
12\t.\t_\t_\t_\t_\t_\t_\t_\t_

'''
>>> path = os.path.join(tempfile.mkdtemp(), 'data.conllx')
>>> with open(path, 'w') as f:
...     f.write(text)
...
>>> print(parser.predict(path, verbose=False).sentences[0])
100%|####################################| 1/1 00:00<00:00, 68.60it/s
# text = But I found the location wonderful and the neighbors very kind.
1       But     _       _       _       _       3       cc      _       _
2       I       _       _       _       _       3       nsubj   _       _
3       found   _       _       _       _       0       root    _       _
4       the     _       _       _       _       5       det     _       _
5       location        _       _       _       _       6       nsubj   _       _
6       wonderful       _       _       _       _       3       xcomp   _       _
7       and     _       _       _       _       6       cc      _       _
7.1     found   _       _       _       _       _       _       _       _
8       the     _       _       _       _       9       det     _       _
9       neighbors       _       _       _       _       11      dep     _       _
10      very    _       _       _       _       11      advmod  _       _
11      kind    _       _       _       _       6       conj    _       _
12      .       _       _       _       _       3       punct   _       _

Training

To train a model from scratch, it is preferred to use the command-line option, which is more flexible and customizable. Here are some training examples:

# Biaffine Dependency Parser
# some common and default arguments are stored in config.ini
$ python -m diaparser.cmds.biaffine_dependency train -b -d 0  \
    -c config.ini  \
    -p exp/en_ptb.char/model  \
    -f char
# to use BERT, `-f` and `--bert` (default to bert-base-cased) should be specified
$ python -m diaparser.cmds.biaffine_dependency train -b -d 0  \
    -p exp/en_ptb.bert-base/model  \
    -f bert  \
    --bert bert-base-cased

For further instructions on training, please type python -m diaparser.cmds.<parser> train -h.

Alternatively, DiaParser provides an equivalent command entry points registered in setup.py: diaparser.

$ diaparser train -b -d 0 -c config.ini -p exp/en_ptb.electra-base/model -f bert --bert google/electra-base-discriminator

For handling large models, distributed training is also supported:

$ python -m torch.distributed.launch --nproc_per_node=4 --master_port=10000  \
    -m parser.cmds.biaffine_dependency train -b -d 0,1,2,3  \
    -p exp/en_ptb.electra-base/model  \
    -f bert --bert google/electra-base-discriminator

You may consult the PyTorch documentation and tutorials for more details.

Evaluation

The evaluation process resembles prediction:

>>> parser = Parser.load('biaffine-dep-en')
>>> loss, metric = parser.evaluate('data/ptb/test.conllx')
2020-07-25 20:59:17 INFO Loading the data
2020-07-25 20:59:19 INFO
Dataset(n_sentences=2416, n_batches=11, n_buckets=8)
2020-07-25 20:59:19 INFO Evaluating the dataset
2020-07-25 20:59:20 INFO loss: 0.2326 - UCM: 61.34% LCM: 50.21% UAS: 96.03% LAS: 94.37%
2020-07-25 20:59:20 INFO 0:00:01.253601s elapsed, 1927.25 Sents/s

TODO

  • Provide a repository where to upload models, like HuggingFace.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diaparser-1.0.2.tar.gz (52.6 kB view hashes)

Uploaded Source

Built Distribution

diaparser-1.0.2-py3-none-any.whl (74.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page