Skip to main content

Syntactic Parsing Models

Project description

SuPar

build docs release downloads LICENSE

SuPar provides a collection of state-of-the-art syntactic parsing models with Biaffine Parser (Dozat and Manning, 2017) as the basic architecture:

You can load released pretrained models for the above parsers and obtain dependency/constituency parsing trees very conveniently, as detailed in Usage.

The implementations of several popular and well-known algorithms, like MST (ChuLiu/Edmonds), Eisner, CKY, MatrixTree, TreeCRF, are also integrated in this package.

Besides POS Tag embeddings used by the vanilla Biaffine Parser as auxiliary inputs to the encoder, optionally, SuPar also allows to utilize CharLSTM/BERT layers to produce character/subword-level features. Among them, CharLSTM is taken as the default option, which avoids additional requirements for generating POS tags, as well as the inefficiency of BERT. The BERT module in SuPar extracts BERT representations from the pretrained model in transformers. It is also compatiable with other language models like XLNet, RoBERTa and ELECTRA, etc.

The CRF models for Dependency/Constituency parsing are our recent works published in ACL 2020 and IJCAI 2020 respectively. If you are interested in them, please cite:

@inproceedings{zhang-etal-2020-efficient,
  title     = {Efficient Second-Order {T}ree{CRF} for Neural Dependency Parsing},
  author    = {Zhang, Yu and Li, Zhenghua and Zhang Min},
  booktitle = {Proceedings of ACL},
  year      = {2020},
  url       = {https://www.aclweb.org/anthology/2020.acl-main.302},
  pages     = {3295--3305}
}

@inproceedings{zhang-etal-2020-fast,
  title     = {Fast and Accurate Neural {CRF} Constituency Parsing},
  author    = {Zhang, Yu and Zhou, Houquan and Li, Zhenghua},
  booktitle = {Proceedings of IJCAI},
  year      = {2020},
  doi       = {10.24963/ijcai.2020/560},
  url       = {https://doi.org/10.24963/ijcai.2020/560},
  pages     = {4046--4053}
}

Contents

Installation

SuPar can be installed via pip:

$ pip install -U supar

Or installing from source is also permitted:

$ git clone https://github.com/yzhangcs/parser && cd parser
$ python setup.py install

As a prerequisite, the following requirements should be satisfied:

Performance

Currently, SuPar provides pretrained models for English and Chinese. English models are trained on Penn Treebank (PTB) with 39,832 training sentences, while Chinese models are trained on Penn Chinese Treebank version 7 (CTB7) with 46,572 training sentences.

The performance and parsing speed of these models are listed in the following table. Notably, punctuation is ignored in all evaluation metrics for PTB, but reserved for CTB7.

Dataset Type Name Metric Performance Speed (Sents/s)
PTB Dependency biaffine-dep-en UAS/LAS 96.0394.37 1826.77
biaffine-dep-bert-en UAS/LAS 96.6995.15 646.66
crfnp-dep-en UAS/LAS 96.0194.42 2197.15
crf-dep-en UAS/LAS 96.1294.50 652.41
crf2o-dep-en UAS/LAS 96.1494.55 465.64
Constituency crf-con-en F1 94.18923.74
crf-con-bert-en F1 95.26503.99
CTB7 Dependency biaffine-dep-zh UAS/LAS 88.7785.631155.50
biaffine-dep-bert-zh UAS/LAS 91.8188.94395.28
crfnp-dep-zh UAS/LAS 88.7885.641323.75
crf-dep-zh UAS/LAS 88.9885.84354.65
crf2o-dep-zh UAS/LAS 89.3586.25217.09
Constituency crf-con-zh F1 88.67 639.27
crf-con-bert-zh F1 91.40 300.15

All results are tested on the machine with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz and Nvidia GeForce GTX 1080 Ti GPU.

Usage

SuPar is very easy to use. You can download the pretrained model and run syntactic parsing over sentences with a few lines of code:

>>> from supar import Parser
>>> parser = Parser.load('biaffine-dep-en')
>>> dataset = parser.predict([['She', 'enjoys', 'playing', 'tennis', '.']], prob=True, verbose=False)
100%|####################################| 1/1 00:00<00:00, 85.15it/s

The call to parser.predict will return an instance of supar.utils.Dataset containing the predicted syntactic trees. For dependency parsing, you can either access each sentence held in dataset or an individual field of all the trees.

>>> print(dataset.sentences[0])
1       She     _       _       _       _       2       nsubj   _       _
2       enjoys  _       _       _       _       0       root    _       _
3       playing _       _       _       _       2       xcomp   _       _
4       tennis  _       _       _       _       3       dobj    _       _
5       .       _       _       _       _       2       punct   _       _

>>> print(f"arcs:  {dataset.arcs[0]}\n"
          f"rels:  {dataset.rels[0]}\n"
          f"probs: {dataset.probs[0].gather(1,torch.tensor(dataset.arcs[0]).unsqueeze(1)).squeeze(-1)}")
arcs:  [2, 0, 2, 3, 2]
rels:  ['nsubj', 'root', 'xcomp', 'dobj', 'punct']
probs: tensor([1.0000, 0.9999, 0.9642, 0.9686, 0.9996])

Probabilities can be returned along with the results if prob=True. As for CRF parsers, marginals are available if mbr=True, i.e., using MBR decoding.

Note that SuPar requires pre-tokenized sentences as inputs. If you'd like to parse un-tokenized raw texts, you can call nltk.word_tokenize to do the tokenization first:

>>> import nltk
>>> text = nltk.word_tokenize('She enjoys playing tennis.')
>>> print(parser.predict([text], verbose=False).sentences[0])
100%|####################################| 1/1 00:00<00:00, 74.20it/s
1       She     _       _       _       _       2       nsubj   _       _
2       enjoys  _       _       _       _       0       root    _       _
3       playing _       _       _       _       2       xcomp   _       _
4       tennis  _       _       _       _       3       dobj    _       _
5       .       _       _       _       _       2       punct   _       _

If there are a plenty of sentences to parse, SuPar also supports for loading them from file, and save to the pred file if specified.

>>> dataset = parser.predict('data/ptb/test.conllx', pred='pred.conllx')
2020-07-25 18:13:50 INFO Load the data
2020-07-25 18:13:52 INFO
Dataset(n_sentences=2416, n_batches=13, n_buckets=8)
2020-07-25 18:13:52 INFO Make predictions on the dataset
100%|####################################| 13/13 00:01<00:00, 10.58it/s
2020-07-25 18:13:53 INFO Save predicted results to pred.conllx
2020-07-25 18:13:54 INFO 0:00:01.335261s elapsed, 1809.38 Sents/s

Please make sure the file is in CoNLL-X format. If some fields are missing, you can use underscores as placeholders. An interface is provided for the transformation from text to CoNLL-X format string.

>>> from supar.utils import CoNLL
>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.']))
1       She     _       _       _       _       _       _       _       _
2       enjoys  _       _       _       _       _       _       _       _
3       playing _       _       _       _       _       _       _       _
4       tennis  _       _       _       _       _       _       _       _
5       .       _       _       _       _       _       _       _       _

For Universial Dependencies (UD), the CoNLL-U file is also allowed, while comment lines in the file can be reserved before prediction and recovered during post-processing.

>>> import os
>>> import tempfile
>>> text = '''# text = But I found the location wonderful and the neighbors very kind.
1\tBut\t_\t_\t_\t_\t_\t_\t_\t_
2\tI\t_\t_\t_\t_\t_\t_\t_\t_
3\tfound\t_\t_\t_\t_\t_\t_\t_\t_
4\tthe\t_\t_\t_\t_\t_\t_\t_\t_
5\tlocation\t_\t_\t_\t_\t_\t_\t_\t_
6\twonderful\t_\t_\t_\t_\t_\t_\t_\t_
7\tand\t_\t_\t_\t_\t_\t_\t_\t_
7.1\tfound\t_\t_\t_\t_\t_\t_\t_\t_
8\tthe\t_\t_\t_\t_\t_\t_\t_\t_
9\tneighbors\t_\t_\t_\t_\t_\t_\t_\t_
10\tvery\t_\t_\t_\t_\t_\t_\t_\t_
11\tkind\t_\t_\t_\t_\t_\t_\t_\t_
12\t.\t_\t_\t_\t_\t_\t_\t_\t_

'''
>>> path = os.path.join(tempfile.mkdtemp(), 'data.conllx')
>>> with open(path, 'w') as f:
...     f.write(text)
...
>>> print(parser.predict(path, verbose=False).sentences[0])
100%|####################################| 1/1 00:00<00:00, 68.60it/s
# text = But I found the location wonderful and the neighbors very kind.
1       But     _       _       _       _       3       cc      _       _
2       I       _       _       _       _       3       nsubj   _       _
3       found   _       _       _       _       0       root    _       _
4       the     _       _       _       _       5       det     _       _
5       location        _       _       _       _       6       nsubj   _       _
6       wonderful       _       _       _       _       3       xcomp   _       _
7       and     _       _       _       _       6       cc      _       _
7.1     found   _       _       _       _       _       _       _       _
8       the     _       _       _       _       9       det     _       _
9       neighbors       _       _       _       _       11      dep     _       _
10      very    _       _       _       _       11      advmod  _       _
11      kind    _       _       _       _       6       conj    _       _
12      .       _       _       _       _       3       punct   _       _

Constituency trees can be parsed in a similar manner. The returned dataset holds all predicted trees represented using nltk.Tree objects.

>>> parser = Parser.load('crf-con-en')
>>> dataset = parser.predict([['She', 'enjoys', 'playing', 'tennis', '.']], verbose=False)
100%|####################################| 1/1 00:00<00:00, 75.86it/s
>>> print(f"trees:\n{dataset.trees[0]}")
trees:
(TOP
  (S
    (NP (_ She))
    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
    (_ .)))
>>> dataset = parser.predict('data/ptb/test.pid', pred='pred.pid')
2020-07-25 18:21:28 INFO Load the data
2020-07-25 18:21:33 INFO
Dataset(n_sentences=2416, n_batches=13, n_buckets=8)
2020-07-25 18:21:33 INFO Make predictions on the dataset
100%|####################################| 13/13 00:02<00:00,  5.30it/s
2020-07-25 18:21:36 INFO Save predicted results to pred.pid
2020-07-25 18:21:36 INFO 0:00:02.455740s elapsed, 983.82 Sents/s

Analogous to dependency parsing, a sentence can be transformed to an empty nltk.Tree conveniently:

>>> from supar.utils import Tree
>>> print(Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], root='TOP'))
(TOP (_ She) (_ enjoys) (_ playing) (_ tennis) (_ .))

Training

To train a model from scratch, it is preferred to use the command-line option, which is more flexible and customizable. Here are some training examples:

# Biaffine Dependency Parser
# some common and default arguments are stored in config.ini
$ python -m supar.cmds.biaffine_dependency train -b -d 0  \
    -c config.ini  \
    -p exp/ptb.biaffine.dependency.char/model  \
    -f char
# to use BERT, `-f` and `--bert` (default to bert-base-cased) should be specified
# if you'd like to use XLNet, you can type `--bert xlnet-base-cased`
$ python -m supar.cmds.biaffine_dependency train -b -d 0  \
    -p exp/ptb.biaffine.dependency.bert/model  \
    -f bert  \
    --bert bert-base-cased

# CRF Dependency Parser
# for CRF dependency parsers, you should use `--proj` to discard all non-projective training instances
# optionally, you can use `--mbr` to perform MBR decoding
$ python -m supar.cmds.crf_dependency train -b -d 0  \
    -p exp/ptb.crf.dependency.char/model  \
    -f char  \
    --mbr  \
    --proj

# CRF Constituency Parser
# the training of CRF constituency parser behaves like dependency parsers
$ python -m supar.cmds.crf_constituency train -b -d 0 -p exp/ptb.crf.constituency.char/model -f char  \
    --mbr

For more instructions on training, please type python -m supar.cmds.<parser> train -h.

Alternatively, SuPar provides some equivalent command entry points registered in setup.py: biaffine-dependency, crfnp-dependency, crf-dependency, crf2o-dependency and crf-constituency.

$ biaffine-dependency train -b -d 0 -c config.ini -p exp/ptb.biaffine.dependency.char/model -f char

To accommodate large models, distributed training is also supported:

$ python -m torch.distributed.launch --nproc_per_node=4 --master_port=10000  \
    -m supar.cmds.biaffine_dependency train -b -d 0,1,2,3  \
    -p exp/ptb.biaffine.dependency.char/model  \
    -f char

You can consult the PyTorch documentation and tutorials for more details.

Evaluation

The evaluation process resembles prediction:

>>> parser = Parser.load('biaffine-dep-en')
>>> loss, metric = parser.evaluate('data/ptb/test.conllx')
2020-07-25 20:59:17 INFO Load the data
2020-07-25 20:59:19 INFO
Dataset(n_sentences=2416, n_batches=11, n_buckets=8)
2020-07-25 20:59:19 INFO Evaluate the dataset
2020-07-25 20:59:20 INFO loss: 0.2326 - UCM: 61.34% LCM: 50.21% UAS: 96.03% LAS: 94.37%
2020-07-25 20:59:20 INFO 0:00:01.253601s elapsed, 1927.25 Sents/s

TODO

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supar-1.0.0.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

supar-1.0.0-py3-none-any.whl (76.6 kB view details)

Uploaded Python 3

File details

Details for the file supar-1.0.0.tar.gz.

File metadata

  • Download URL: supar-1.0.0.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for supar-1.0.0.tar.gz
Algorithm Hash digest
SHA256 df86a5358ffc7a8700a4a45b08d9abd57abc06b548c48cd579884d1e947e370f
MD5 f83a43096fb0c9e8c9e93ac21b127a21
BLAKE2b-256 1c1a74687150657b31081990cb2b948cae299cb175952b5164664c1656953337

See more details on using hashes here.

File details

Details for the file supar-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: supar-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 76.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for supar-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2eb6dc0129b35607fd8fde787157107aca0ffd637e5ce9f6e348e82839317325
MD5 c5dcb67e994c253923247fc1b758f919
BLAKE2b-256 ea02f026dee15f3997119d676eb6b6f55b985dda2dbff992b581b2668527562a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page