Skip to main content

A set of convenience tools for Natural Language Processing work with CoNLL-U files, UD treebanks, and annotated corpora.

Project description

CoNLL-U Tools

License Python Tests Documentation

CoNLL-U Tools is a Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora. It provides utilities for format conversion, validation, evaluation, pattern matching, and morphological normalization, supporting workflows with CoNLL-U and brat standoff formats.

Read the documentation

Features

  • Format Conversion: Bidirectional conversion between brat standoff and CoNLL-U formats
  • Validation: Check CoNLL-U files for format compliance and annotation guideline adherence
  • Evaluation: Score parser outputs against gold-standard files with comprehensive metrics
  • Pattern Matching: Find tokens and sentences matching complex linguistic criteria
  • Morphological Utilities: Normalize features, convert between tagsets (Perseus, ITTB, PROIEL, LLCT)
  • Extensible: Add custom tagset converters and feature mappings

For detailed information about each feature, see the User Guide.

Installation

Quick Install

pip install conllu_tools

For detailed installation instructions, including platform-specific guidance and troubleshooting, see the Installation Guide.

Quick Start

Convert CoNLL-U to brat

from conllu_tools.io import conllu_to_brat

conllu_to_brat(
    conllu_filename='path/to/conllu/yourfile.conllu',
    output_directory='path/to/brat/files',
    sents_per_doc=10,
    output_root=True,
)

# Outputs .ann and .txt files to 'path/to/brat/files', alongside
# annotation.conf, tools.conf, visual.conf, and metadata.json

Convert brat to CoNLL-U

from conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data

feature_set = load_language_data('feats', language='la')
brat_to_conllu(
    input_directory='path/to/brat/files',
    output_directory='path/to/conllu',
    ref_conllu='yourfile.conllu',
    feature_set=feature_set,
    output_root=True
)

# Outputs yourfile-from_brat.conllu to 'path/to/conllu'

Validate CoNLL-U Files

from conllu_tools import ConlluValidator

validator = ConlluValidator(lang='la', level=2)
reporter = validator.validate_file('path/to/yourfile.conllu')

# Print error count
print(f'Errors found: {reporter.get_error_count()}')

# Inspect first error
sent_id, order, testlevel, error = reporter.errors[0]
print(f'Sentence ID: {sent_id}')  # e.g. 34
print(f'Testing at level: {testlevel}')  # e.g. 2
print(f'Error test level: {error.testlevel}')  # e.g. 1
print(f'Error type: {error.error_type}')  # e.g. "Metadata"
print(f'Test ID: {error.testid}')  # e.g. "text-mismatch"
print(f'Error message: {error.msg}')  # Full error message (see below)

# Print all errors formatted as strings
for error in reporter.format_errors():
    print(error)

# Example output:
# Sentence 34:
# [L2 Metadata text-mismatch] The text attribute does not match the text 
# implied by the FORM and SpaceAfter=No values. Expected: 'Una scala....' 
# Reconstructed: 'Una scala ....' (first diff at position 9)

Evaluate CoNLL-U Files

from conllu_tools import ConlluEvaluator

evaluator = ConlluEvaluator(eval_deprels=True, treebank_type='0')
scores = evaluator.evaluate_files(
    gold_path='path/to/gold_standard.conllu',
    system_path='path/to/system_output.conllu',
)

print(f'UAS: {scores["UAS"].f1:.2%}')
print(f'LAS: {scores["LAS"].f1:.2%}')

# Example output:
# UAS: 64.82%
# LAS: 48.16%

Pattern Matching

import conllu
from conllu_tools.matching import build_pattern, find_in_corpus

# Load corpus
with open('corpus.conllu', encoding='utf-8') as f:
    corpus = conllu.parse(f.read())

# Find all adjective + noun sequences
pattern = build_pattern('ADJ+NOUN', name='adj-noun')
matches = find_in_corpus(corpus, [pattern])

for match in matches:
    print(f"[{match.sentence_id}] {match.substring}")
    print(f"  Forms: {match.forms}")
    print(f"  Lemmata: {match.lemmata}")

# More pattern examples:
build_pattern('NOUN:lemma=rex')                    # Noun with lemma 'rex'
build_pattern('NOUN:feats=(Case=Abl)')             # Ablative noun
build_pattern('DET+ADJ{0,2}+NOUN')                 # Det + 0-2 adjectives + noun
build_pattern('ADP+NOUN:feats=(Case=Acc)')         # Preposition + accusative noun

For more examples and detailed usage, see the Quickstart Guide.

Documentation

The full documentation includes:

Acknowledgments

This toolkit builds upon and extends code from several sources:

License

The project is licensed under the MIT License, allowing free use, modification, and distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conllu_tools-1.3.0.tar.gz (277.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

conllu_tools-1.3.0-py3-none-any.whl (320.0 kB view details)

Uploaded Python 3

File details

Details for the file conllu_tools-1.3.0.tar.gz.

File metadata

  • Download URL: conllu_tools-1.3.0.tar.gz
  • Upload date:
  • Size: 277.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for conllu_tools-1.3.0.tar.gz
Algorithm Hash digest
SHA256 f0a0dd6fbad266a7ce8bb5f4a3447f27e172234e60e0e610e03eac70f0a28804
MD5 f7a4670a8171cbe67b7aee8bf7ebe30c
BLAKE2b-256 1e80f2f4af3bb053230d3b6d4363ee313332ff6e7721b428518c0d7650ed361e

See more details on using hashes here.

File details

Details for the file conllu_tools-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: conllu_tools-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 320.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for conllu_tools-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 22f4be8450ae8977adbf7b01d1a8fd0259a2dc068ba047b491c668892bc9e13e
MD5 d1e2fac850824ae776e263410c8b2824
BLAKE2b-256 4bf5344960f8683f50cdc18116050a444f3c8ba7f2717ce46d4ae2a9f260afb8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page