Skip to main content

Basic tools for working with categorial grammars

Project description

catgram

Basic tools for working with categorial grammars

This is a simple Python package providing some basic tools for working with categorial grammars. Development is on-again, off-again. Bug reports and feature requests are welcome—especially if it's for an item on the TODO list below, as you'd be providing extra motivation! :)

This package also includes a CCG dependency evaluation script that implements decomposed scoring as specified in Decomposed scoring of CCG dependencies. See below for examples. If you use decomposed scoring in your research, please cite the paper:

The script can also do the regular CCG dependency evaluation (examples).

In general, if you use this package in your research, please include a link to the GitHub repository, and remember to cite any appropriate research papers depending on your usage (e.g., for decomposed scoring as mentioned above; cite (Lewis and Steedman, 2014) if you use this package's implementation of their head-finding rules; etc.).

Requirements

  • Python 3.10+
  • lambda_calculus Python package (will be installed automatically by pip command below)

Installation

In your environment of choice:

$ pip3 install catgram

If you only want to run the evaluation script, you might want to consider using pipx to keep the installation isolated:

$ pipx install catgram
$ ccg_depeval -h

Or use it to run the script in a temporary environment:

$ pipx run --spec catgram ccg_depeval -h

Examples

Decomposed scoring

In additional to subcategorial labelling and alignment, decomposed scoring specifies the inclusion of root nodes. Most parsers do not explicitly specify these, and if they do, they must be extracted from heads specified in the .auto file (as far as I know, EasyCCG is the only parser that does this). The ccg_depeval script includes the facility for extracting root dependencies as necessary from parser .auto files (most statistical CCG parsers at least have the option to output these).

Usage is as follows:

$ ccg_depeval ground_truth_deps sys_deps ground_truth.auto sys.auto

where:

  • ground_truth_deps is the ground-truth dependencies, usually as produced by the parg2ccgbank_deps script from C&C (the actual filenames will be wsj00.ccgbank_deps for the dev set or wsj23.ccgbank_deps for the test set). The original PARG file format from CCGbank can also be used.
  • sys_deps is the dependencies predicted by a statistical parser, usually as produced by the generate program from C&C (I recommend using what's available in the Java version of C&C as it is updated compared to what's in the original C&C package).
  • ground_truth.auto is the ground-truth .auto file (e.g., straight from CCGbank). The heads specified in this file are followed directly according to the syntax specified in CCGbank.
  • sys.auto is the parse preidcted by the statistical parser. By default, the head-finding rules of Lewis and Steedman (2014) are followed to extract the root node.

Note: instead of .auto files, you can also provide root node information directly in a .roots file. The format is as produced by the ccg_roots script (examples).

A warning will be issued if there is no root available for a sentence, including if the last two arguments aren't specified. You can use the -r option to suppress this warning if you don't want to fuss with root nodes:

$ ccg_depeval -r ground_truth_deps sys_deps

This can be handy for, e.g., evaluating the Java version of the C&C parser, which doesn't produce a .auto file and instead produces a .deps file directly. Of course, omitting the root nodes will produce different scores. See (Bhargava and Penn, 2023) and (Bhargava, 2022, chapter 5) for examples of why you should include root nodes.

Other options of the script allow you to control whether subcategorial labelling and/or alignment are used as well, or to print per-sentence scores. See the script's help for full details:

$ ccg_depeval -h

Standard CCG scoring

For convenience, you can use the -s flag when running ccg_depeval to revert to the standard CCG scoring method:

$ ccg_depeval -s ground_truth_deps sys_deps

Extracting roots

This package also includes a standalone root-extraction script. For example:

$ ccg_roots -m ls14 sys.auto
will_8 S[dcl]
is_3 S[dcl]
...

It's important to use -m ls14 for a .auto file generated by most statistical parsers and -m autofile (which is the same as omitting the -m option) for a .auto file where the heads specified as per the .auto file syntax are indeed the desired heads. The latter case is applicable to CCGbank's .auto files but not those produced by most parsers, since they do not indicate the semantic heads. (EasyCCG is the biggest exception to this, and indeed, the ls14 rules are the same as used by that parser.) If you use the ls14 option for something in your research, make sure to cite Lewis and Steedman (2014) as Section 3.5 of that paper is where the rules were originally specified.

See the script help for full usage details:

$ ccg_roots -h

TODO

  • Tests
  • Ability to directly evaluate CCG .auto files
    • This would re-implement the functionality of the generate program from C&C (or the more directly-integrated version of this process in Java C&C) so that new parsers wouldn't need to go back to C&C to do the evaluations
  • Examples for basic usage (for CategoryTree and TermGraph)
    • For now, take a look at dependencies.py for examples of how to use CategoryTree. If examples are even slightly of interest to you, please submit a GitHub issue asking for them as doing so will help motivate me to add them!
  • Other tools that might be useful?
    • Evaluation scripts (e.g., for evaluating statistical parsers)
    • Visualization tools (CCG dependency graphs, LCG term graphs; outputs to SVG, LaTeX...)

License

Unless otherwise stated, all files in this package are subject to the below copyright and license. The main exception is candc_ignore.py, which is derived from the original C&C package and thus covered by the C&C System Licence Agreement. The code therein is reproduced with permission for inclusion in this package.

Copyright 2023 Aditya Bhargava

Licensed under the Apache License, Version 2.0 (the "License"); you may not use the files in this repository except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 or in the LICENSE file in this repository.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catgram-0.3.0.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

catgram-0.3.0-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file catgram-0.3.0.tar.gz.

File metadata

  • Download URL: catgram-0.3.0.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Linux/6.6.10-1-MANJARO

File hashes

Hashes for catgram-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4df9569870b25a5fe96eb2ed75f74cd6d382f3ca911c4cc22fda08f3fd68350b
MD5 e24c6dda02cefffc8b6f34665696bec8
BLAKE2b-256 875f55353e1e3ff1191500e0ce433042e8959b61d46dac74f3f70b30d5a19907

See more details on using hashes here.

File details

Details for the file catgram-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: catgram-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Linux/6.6.10-1-MANJARO

File hashes

Hashes for catgram-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b33b2624f4e03a7ebf457a75508bffd55872f88951acc0b97e091f28757e631
MD5 72bcc6d536695a48b385dcfd32336d0e
BLAKE2b-256 c347bee2eee7114d877feb7d5f71dcdc9ef05229fa8bb4d71e043bf0fc16641e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page