A parser for natural language based on combinatory categorial grammar

These details have not been verified by PyPI

Project links

Homepage

Project description

depccg v2

Codebase for A* CCG Parsing with a Supertag and Dependency Factored Model

2021/07/12 Updates (v2)

Increased stability and efficiency
- (Replaced OpenMP with multiprocessing)
More integration with AllenNLP
- The parser is now callable from within a predictor (see here)
More friendly way to define your own grammar (wrt. languages or treebanks)
- See depccg/grammar/{en,ja}.py for example grammars.

Requirements

Python >= 3.6.0
A C++ compiler supporting C++11 standard (in case of gcc, must be >= 4.8)

Installation

Using pip:

➜ pip install cython numpy depccg

Usage

Using a pretrained English parser

Currently following models are available for English:

Name	Description	unlabeled/labeled F1 on CCGbank	Download
basic	model trained on the combination of CCGbank and tri-training dataset (Yoshikawa et al., 2017)	94.0%/88.8%	link (189M)
`elmo`	basic model with its embeddings replaced with ELMo (Peters et al., 2018)	94.98%/90.51%	link (649M)
`rebank`	basic model trained on Rebanked CCGbank (Honnibal et al., 2010)	-	link (337M)
`elmo_rebank`	ELMo model trained on Rebanked CCGbank	-	link (1G)

The basic model is available by:

➜ depccg_en download

To use:

➜ echo "this is a test sentence ." | depccg_en
ID=1, Prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP XX XX this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP XX XX is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N XX XX a NP[nb]/N>) (<T N 0 2> (<L N/N XX XX test N/N>) (<L N XX XX sentence N>) ) ) ) ) (<L . XX XX . .>) )

You can download other models by specifying their names:

➜ depccg_en download elmo

To use, make sure to install allennlp:

➜ echo "this is a test sentence ." | depccg_en --model elmo

You can also specify in the --model option the path of a model file (in tar.gz) that is available from links above.

Using a GPU (by --gpu option) is recommended if possible.

There are several output formats (see below).

➜ echo "this is a test sentence ." | depccg_en --format deriv
ID=1, Prob=-0.0006299018859863281
 this        is           a      test  sentence  .
  NP   (S[dcl]\NP)/NP  NP[nb]/N  N/N      N      .
                                ---------------->
                                       N
                      -------------------------->
                                  NP
      ------------------------------------------>
                      S[dcl]\NP
------------------------------------------------<
                     S[dcl]
---------------------------------------------------<rp>
                      S[dcl]

By default, the input is expected to be pre-tokenized. If you want to process untokenized sentences, you can pass --tokenize option.

The POS and NER tags in the output are filled with XX by default. You can replace them with ones predicted using SpaCy:

➜ echo "this is a test sentence ." | depccg_en --annotator spacy
ID=1, Prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP DT DT this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP VBZ VBZ is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N DT DT a NP[nb]/N>) (<T N 0 2> (<L N/N NN NN test N/N>) (<L N NN NN sentence N>) ) ) ) ) (<L . . . . .>) )

The parser uses a SpaCy's en_core_web_sm model.

Orelse, you can use POS/NER taggers implemented in C&C, which may be useful in some sorts of parsing experiments:

➜ export CANDC=/path/to/candc
➜ echo "this is a test sentence ." | depccg_en --annotator candc
ID=1, log prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP DT DT this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP VBZ VBZ is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N DT DT a NP[nb]/N>) (<T N 0 2> (<L N/N NN NN test N/N>) (<L N NN NN sentence N>) ) ) ) ) (<L . . . . .>) )

By default, depccg expects the POS and NER models are placed in $CANDC/models/pos and $CANDC/models/ner, but you can explicitly specify them by setting CANDC_MODEL_POS and CANDC_MODEL_NER environmental variables.

It is also possible to obtain logical formulas using ccg2lambda's semantic parsing algorithm.

➜ echo "This is a test sentence ." | depccg_en --format ccg2lambda --annotator spacy
ID=0 log probability=-0.0006299018859863281
exists x.(_this(x) & exists z1.(_sentence(z1) & _test(z1) & (x = z1)))

Using a pretrained Japanese parser

The best performing model is available by:

➜ depccg_ja download

It can be downloaded directly here (56M).

The parser provides the almost same interface as with the English one, with slight differences including the default output format, which is now one compatible with the Japanese CCGbank:

➜ echo "これはテストの文です。" | depccg_ja
ID=1, Prob=-53.98793411254883
{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] これ/これ/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] は/は/**}} {< S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] テスト/テスト/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] の/の/**}} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] 文/文/**}} {(S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f])\NP[case=nc,mod=nm,fin=f] です/です/**}}} {S[mod=nm,form=base,fin=t]\S[mod=nm,form=base,fin=f] 。/。/**}}

You can pass pre-tokenized sentences as well:

➜ echo "これ は テスト の 文 です 。" | depccg_ja --pre-tokenized
ID=1, Prob=-53.98793411254883
{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] これ/これ/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] は/は/**}} {< S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] テスト/テスト/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] の/の/**}} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] 文/文/**}} {(S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f])\NP[case=nc,mod=nm,fin=f] です/です/**}}} {S[mod=nm,form=base,fin=t]\S[mod=nm,form=base,fin=f] 。/。/**}}

Available output formats

auto - the most standard format following AUTO format in the English CCGbank
auto_extended - extension of auto format with combinator info and POS/NER tags
deriv - visualized derivations in ASCII art
xml - XML format compatible with C&C's XML format (only for English parsing)
conll - CoNLL format
html - visualized trees in MathML
prolog - Prolog-like format
jigg_xml - XML format compatible with Jigg
ptb - Penn Treebank-style format
ccg2lambda - logical formula converted from a derivation using ccg2lambda
jigg_xml_ccg2lambda - jigg_xml format with ccg2lambda logical formula inserted
json - JSON format
ja - a format adopted in Japanese CCGbank (only for Japanese)

Programmatic Usage

Please look into depccg/__main__.py.

Train your own parsing model

You can use my allennlp-based supertagger and extend it.

To train a supertagger, prepare the English CCGbank and download vocab:

➜ cat ccgbank/data/AUTO/{0[2-9],1[0-9],20,21}/* > wsj_02-21.auto
➜ cat ccgbank/data/AUTO/00/* > wsj_00.auto

➜ wget http://cl.naist.jp/~masashi-y/resources/depccg/vocabulary.tar.gz
➜ tar xvf vocabulary.tar.gz

then,

➜ vocab=vocabulary train_data=wsj_02-21.auto test_data=wsj_00.auto gpu=0 \
  encoder_type=lstm token_embedding_type=char \
  allennlp train --include-package depccg --serialization-dir results depccg/allennlp/configs/supertagger.jsonnet

The training configs are passed either through environmental variables or directly writing to jsonnet config files, which are available in supertagger.jsonnet or supertagger_tritrain.jsonnet. The latter is a config file for using tri-training silver data (309M) constructed in (Yoshikawa et al., 2017), on top of the English CCGbank.

To use the trained supertagger,

➜ echo '{"sentence": "this is a test sentence ."}' > input.jsonl
➜ allennlp predict results/model.tar.gz --include-package depccg --output-file weights.json input.jsonl

or alternatively, you can perform CCG parsing:

➜ allennlp predict --include-package depccg --predictor parser-predictor --predictor-args '{"grammar_json_path": "depccg/models/config_en.jsonnet"}' model.tar.gz input.jsonl

Evaluation in terms of predicate-argument dependencies

The standard CCG parsing evaluation can be performed with the following script:

➜ cat ccgbank/data/PARG/00/* > wsj_00.parg
➜ export CANDC=/path/to/candc
➜ python -m depccg.tools.evaluate wsj_00.parg wsj_00.predicted.auto

The script is dependent on C&C's generate program, which is only available by compiling the C&C program from the source.

(Currently, the above page is down. You can find the C&C parser here or here)

Miscellaneous

Diff tool

In error analysis, you must want to see diffs between trees in an intuitive way. depccg.tools.diff does exactly this:

➜ python -m depccg.tools.diff file1.auto file2.auto > diff.html

which outputs:

show diffs between trees

where trees in the same lines of the files are compared and the diffs are marked in color.

Citation

If you make use of this software, please cite the following:

    @inproceedings{yoshikawa:2017acl,
      author={Yoshikawa, Masashi and Noji, Hiroshi and Matsumoto, Yuji},
      title={A* CCG Parsing with a Supertag and Dependency Factored Model},
      booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
      publisher={Association for Computational Linguistics},
      year={2017},
      pages={277--287},
      location={Vancouver, Canada},
      doi={10.18653/v1/P17-1026},
      url={http://aclweb.org/anthology/P17-1026}
    }

Licence

MIT Licence

Contact

For questions and usage issues, please contact yoshikawa@tohoku.jp.

Acknowledgement

In creating the parser, I owe very much to:

EasyCCG: from which I learned everything
NLTK: for nice pretty printing for parse derivation

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.3.2

Feb 15, 2022

2.0.3

Feb 11, 2022

This version

2.0.2

Jul 27, 2021

2.0.1

Jul 12, 2021

2.0.0

Jul 12, 2021

1.1.0

Mar 29, 2020

1.0.8

Sep 22, 2019

1.0.7

Jun 13, 2019

1.0.6

Jun 10, 2019

1.0.5

Apr 25, 2019

1.0.4

Apr 25, 2019

1.0.3

Apr 13, 2019

1.0.2

Apr 9, 2019

1.0.1

Apr 9, 2019

1.0.0

Apr 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

depccg-2.0.2.tar.gz (3.5 MB view details)

Uploaded Jul 27, 2021 Source

File details

Details for the file depccg-2.0.2.tar.gz.

File metadata

Download URL: depccg-2.0.2.tar.gz
Upload date: Jul 27, 2021
Size: 3.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.2

File hashes

Hashes for depccg-2.0.2.tar.gz
Algorithm	Hash digest
SHA256	`de1a7ce6a9d707a2a8dc5c0730ed2d435694d4c297e747a8c9a6c329cc4438a4`
MD5	`fb42403b32b50e8930a16e7cf7f12afd`
BLAKE2b-256	`a9928f3b372662f63e0c4af3cbad41daffeeb6d0df496dc64e7f843a66d55016`