Skip to main content

Python tools for Universal Dependencies

Project description

UD Tools

This package contains Python tools for Universal Dependencies.

The official UD/CoNLL-U validator

Reads a CoNLL-U file and verifies that it complies with the UD specification. For more details on UD validation, visit the description on the UD website.

The most up-to-date version of the validator always resides in the master branch of the tools repository on GitHub. It is possible to run the script validate.py from your local copy of the repository even without installing the udtools package via pip. Nevertheless, you will need a few third-party modules the validator depends on. You can install them like this: pip install -r requirements.txt.

If the root folder of the tools repository is in your system PATH, you do not have to be in that folder when launching the script:

cat la_proiel-ud-train.conllu | python validate.py --lang la --max-err=0

You can run python validate.py --help for a list of available options.

Invoking validation from your Python program

To use the validator from your Python code, first install udtools (possibly after creating and activating a virtual environment). This should give you access to a fairly recent version of the validator but it will not necessarily be the authoritative version, as it may lack some modifications of the language-specific data. pip install --upgrade udtools

from udtools import Validator

validator = Validator(lang='la')
state = validator.validate_files(['la_proiel-ud-train.conllu', 'la_proiel-ud-dev.conllu', 'la_proiel-ud-test.conllu'])
print(state)

The state is an object with various pieces of information collected during the validation run. Its string representation is a summary of the warnings and errors found, as well as the string "*** PASSED " or " FAILED ***". You can also use the state in a boolean context (condition), where "passed" evaluates as True and "failed" as False. Note however that the default behavior of the validator is still to print errors and warnings to STDERR as soon as they are detected. To suppress printing, the only possibility at present is to supply the --quiet option as if it came from the command line:

import sys
from udtools.argparser import parse_args
from udtools import Validator

sys.argv = ['validate.py', '--lang=la', '--quiet']
args = parse_args()
validator = Validator(lang='la', args=args)
state = validator.validate_files(['la_proiel-ud-train.conllu', 'la_proiel-ud-dev.conllu', 'la_proiel-ud-test.conllu'])
if state:
    print('Yay!')
else:
    print('Oh no ☹')

Instead of printing the errors to STDERR as soon as they are found, you can have them saved in the validation state and later process them the way you prefer. Note that the number of incidents saved (per category) is limited by default. This is to save your memory if you do not need to keep the errors (some treebanks have hundreds of thousands of errors and warnings). By setting --max-store=0, this limit is turned off.

import sys
from udtools.argparser import parse_args
from udtools import Validator
from udtools.incident import IncidentType

sys.argv = ['validate.py', '--lang=la', '--quiet', '--max-store=0']
args = parse_args()
validator = Validator(lang='la', args=args)
state = validator.validate_files(['la_proiel-ud-train.conllu', 'la_proiel-ud-dev.conllu', 'la_proiel-ud-test.conllu'])
all_errors = []
# Take only errors, skip warnings.
for testclass in state.error_tracker[IncidentType.ERROR].keys():
   for incident in state.error_tracker[IncidentType.ERROR][testclass]:
       all_errors.append(incident)
all_errors.sort(key=lambda incident: incident.testid)
for error in all_errors:
    print(error)

Entry points

The validator has several other entry points in addition to validate_files():

  • validate_file() takes just one file name (path), reads that file and tests its validity. If the file name is '-', it is interpreted as reading from STDIN. Note that validate_files() calls validate_file() for each file in turn, then it also calls validate_end() to perform checks that can only be done after the whole treebank has been read. If you call directly validate_file(), you should take care of calling validate_end() yourself.
    • validate_end() takes just the state from the validation performed so far, and checks that the observations saved in the state are not in conflict.
  • validate_file_handle() takes the object associated with an open file (or sys.stdin). Otherwise it is analogous to validate_file() (and is in fact called from validate_file()).
  • validate_sentence() takes a list of CoNLL-U lines corresponding to one sentence, including the sentence-terminating empty line. When called from validate_file_handle(), it will have at most one empty line and this will be the last line in the list, as it is how the file reader detected the sentence end. However, the method is aware that other callers could supply lists with empty lines in the middle, and it will report an error if this happens.

All the validate_*() methods mentioned above return a State object. All of them can optionally take a State from previous runs as an argument (named state), in which case they will base their decisions on this state, and save their observations in it, too.

The validator uses data files with specifications of feature values, lemmas of auxiliaries etc. for each language. These files change more often than the validator code itself, so it is likely that your pip-installed udtools does not have the most up-to-date version. Therefore, you may want to have a local copy of the tools repository, regularly update it by calling git pull, and tell the validator where to load the data files from (instead of its installation location):

validator = Validator(lang='la', datapath='/my/copy/of/ud/tools/data')

Selecting only some tests

UD defines several levels of validity of CoNLL-U files. By default, validity on the highest level 5 is required; this is the level that UD treebanks must pass in order to be released as part of Universal Dependencies. It is possible to request a lower level of validity, for example, only the backbone file structure can be checked, omitting any linguistic checks of the annotation guidelines. When invoking validate.py from the command line, the numeric option --level (e.g., --level 1) tells the validator to skip tests on levels 2 and above. The same argument can be given directly to the constructor of the Validator class. The lowest level is not specific to individual languages, so we can give the generic language "ud" instead.

validator = Validator(lang='ud', level=1)

One may want to filter the tests along various other dimensions: errors only (skipping warnings); selected test classes (FORMAT, MORPHO, SYNTAX, ENHANCED, METADATA etc.); individual test ids (e.g., obl-should-be-nmod). It is always possible to do what we showed above, i.e., collecting all incidents, then processing them and showing only the selected ones. However, this approach has its drawbacks: We waste time by running tests whose results we do not want to see; for large treebanks it is not practical to postpone showing first results until the whole treebank is processed; and it may be also quite heavy to keep all unnecessary incidents in memory.

You may try to get around this by implementing your own alternative to validate_sentence() and call individual tests directly. There are some dangers though, which you should consider first:

  • The tests are not documented at present, so you have to consult the source code. The relevant functions are methods of Validator and their names start with check_ (as opposed to validate_, which signals the better documented entry points). Note that one check_ method may generate multiple different incident types, whose ids are not reflected in the name of the method; and a few incidents can even occur outside any check_ method (e.g., directly in a validate_ method).
  • The interface is far from stable. Names of methods may change at any time, as well as the types of incidents they generate, the arguments they expect, their return values (if any) or side effects. Some checks only look at individual cells in the CoNLL-U tabular format, others expect the fully built tree structure.
  • There are dependencies among the tests. Some check_ methods can be run safely only if other check_ methods have been run previously and did not find any errors.

Adding your own tests

You may want to add language-specific consistency tests beyond what the official validator can do (e.g., ensuring that all personal pronouns have a non-empty value of the Person feature), or even treebank/project-specific tests (e.g., all tokens should have a valid Ref attribute in MISC). One way of doing this would be to derive your own validator class based on udtools.Validator.

from udtools import Validator

class MyValidator(Validator):

    def validate_sentence(self, lines, state=None):
        state = super().validate_sentence(lines, state)
        self.check_my_own_stuff(state, lines)
        return state

    def check_my_own_stuff(self, state, lines):
        ...

validator = MyValidator(lang='la')
state = validator.validate_files(['la_proiel-ud-train.conllu', 'la_proiel-ud-dev.conllu', 'la_proiel-ud-test.conllu'])
print(state)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

udtools-0.1.16.tar.gz (725.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

udtools-0.1.16-py3-none-any.whl (744.7 kB view details)

Uploaded Python 3

File details

Details for the file udtools-0.1.16.tar.gz.

File metadata

  • Download URL: udtools-0.1.16.tar.gz
  • Upload date:
  • Size: 725.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for udtools-0.1.16.tar.gz
Algorithm Hash digest
SHA256 cca2b8a30a026cd3e7c7a037d1bd35c7077b54616ab524cda20957092b46c3f2
MD5 33a40bdc15e49d4db8c9062d4a4f3d3d
BLAKE2b-256 5280b3aebec5380e5cbe046645525bb308f8781c54579b14daac39dbbbc1507c

See more details on using hashes here.

File details

Details for the file udtools-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: udtools-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 744.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for udtools-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 48344e1edef581f88e762db16a02a5502704f45b05b801a3ef34c9b4be921cfa
MD5 77e8f6a180e3beb73026c8fbbb952f76
BLAKE2b-256 afd83016fe228ceceac9763a475d8cd2943a1a32a1b8f7fcf416968cfe6081e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page