A set of convenience tools for Natural Language Processing work with CoNLL-U files, UD treebanks, and annotated corpora.
Project description
CoNLL-U Tools
CoNLL-U Tools is a Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora. It provides utilities for format conversion, validation, evaluation, pattern matching, and morphological normalization, supporting workflows with CoNLL-U and brat standoff formats.
Features
- Format Conversion: Bidirectional conversion between brat standoff and CoNLL-U formats
- Validation: Check CoNLL-U files for format compliance and annotation guideline adherence
- Evaluation: Score parser outputs against gold-standard files with comprehensive metrics
- Pattern Matching: Find tokens and sentences matching complex linguistic criteria
- Morphological Utilities: Normalize features, convert between tagsets (Perseus, ITTB, PROIEL, LLCT)
- Extensible: Add custom tagset converters and feature mappings
For detailed information about each feature, see the User Guide.
Installation
Quick Install
pip install conllu_tools
For detailed installation instructions, including platform-specific guidance and troubleshooting, see the Installation Guide.
Quick Start
Convert CoNLL-U to brat
from conllu_tools.io import conllu_to_brat
conllu_to_brat(
conllu_filename='path/to/conllu/yourfile.conllu',
output_directory='path/to/brat/files',
sents_per_doc=10,
output_root=True,
)
# Outputs .ann and .txt files to 'path/to/brat/files', alongside
# annotation.conf, tools.conf, visual.conf, and metadata.json
Convert brat to CoNLL-U
from conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data
feature_set = load_language_data('feats', language='la')
brat_to_conllu(
input_directory='path/to/brat/files',
output_directory='path/to/conllu',
ref_conllu='yourfile.conllu',
feature_set=feature_set,
output_root=True
)
# Outputs yourfile-from_brat.conllu to 'path/to/conllu'
Validate CoNLL-U Files
from conllu_tools import ConlluValidator
validator = ConlluValidator(lang='la', level=2)
reporter = validator.validate_file('path/to/yourfile.conllu')
# Print error count
print(f'Errors found: {reporter.get_error_count()}')
# Inspect first error
sent_id, order, testlevel, error = reporter.errors[0]
print(f'Sentence ID: {sent_id}') # e.g. 34
print(f'Testing at level: {testlevel}') # e.g. 2
print(f'Error test level: {error.testlevel}') # e.g. 1
print(f'Error type: {error.error_type}') # e.g. "Metadata"
print(f'Test ID: {error.testid}') # e.g. "text-mismatch"
print(f'Error message: {error.msg}') # Full error message (see below)
# Print all errors formatted as strings
for error in reporter.format_errors():
print(error)
# Example output:
# Sentence 34:
# [L2 Metadata text-mismatch] The text attribute does not match the text
# implied by the FORM and SpaceAfter=No values. Expected: 'Una scala....'
# Reconstructed: 'Una scala ....' (first diff at position 9)
Evaluate CoNLL-U Files
from conllu_tools import ConlluEvaluator
evaluator = ConlluEvaluator(eval_deprels=True, treebank_type='0')
scores = evaluator.evaluate_files(
gold_path='path/to/gold_standard.conllu',
system_path='path/to/system_output.conllu',
)
print(f'UAS: {scores["UAS"].f1:.2%}')
print(f'LAS: {scores["LAS"].f1:.2%}')
# Example output:
# UAS: 64.82%
# LAS: 48.16%
Pattern Matching
import conllu
from conllu_tools.matching import build_pattern, find_in_corpus
# Load corpus
with open('corpus.conllu', encoding='utf-8') as f:
corpus = conllu.parse(f.read())
# Find all adjective + noun sequences
pattern = build_pattern('ADJ+NOUN', name='adj-noun')
matches = find_in_corpus(corpus, [pattern])
for match in matches:
print(f"[{match.sentence_id}] {match.substring}")
print(f" Forms: {match.forms}")
print(f" Lemmata: {match.lemmata}")
# More pattern examples:
build_pattern('NOUN:lemma=rex') # Noun with lemma 'rex'
build_pattern('NOUN:feats=(Case=Abl)') # Ablative noun
build_pattern('DET+ADJ{0,2}+NOUN') # Det + 0-2 adjectives + noun
build_pattern('ADP+NOUN:feats=(Case=Acc)') # Preposition + accusative noun
For more examples and detailed usage, see the Quickstart Guide.
Documentation
The full documentation includes:
- Installation Guide: Detailed installation instructions and troubleshooting
- Quickstart Guide: Get started quickly with common tasks
- User Guide: Comprehensive guides for all features
- Conversion: CoNLL-U ↔ brat conversion
- Validation: Validation framework and recipes
- Evaluation: Metrics and evaluation workflows
- Pattern Matching: Find complex linguistic patterns
- Utilities: Tagset conversion and normalization
- API Reference: Complete API documentation
Acknowledgments
This toolkit builds upon and extends code from several sources:
- CoNLL-U/brat conversion logic is based on the tools made available by the brat team.
- CoNLL-U evaluation is based on the work of Milan Straka and Martin Popel for the CoNLL 2018 UD shared task, and Gosse Bouma for the IWPT 2020 shared task.
- CoNLL-U validation is based on work by Filip Ginter and Sampo Pyysalo.
License
The project is licensed under the MIT License, allowing free use, modification, and distribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file conllu_tools-1.3.0.tar.gz.
File metadata
- Download URL: conllu_tools-1.3.0.tar.gz
- Upload date:
- Size: 277.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0a0dd6fbad266a7ce8bb5f4a3447f27e172234e60e0e610e03eac70f0a28804
|
|
| MD5 |
f7a4670a8171cbe67b7aee8bf7ebe30c
|
|
| BLAKE2b-256 |
1e80f2f4af3bb053230d3b6d4363ee313332ff6e7721b428518c0d7650ed361e
|
File details
Details for the file conllu_tools-1.3.0-py3-none-any.whl.
File metadata
- Download URL: conllu_tools-1.3.0-py3-none-any.whl
- Upload date:
- Size: 320.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22f4be8450ae8977adbf7b01d1a8fd0259a2dc068ba047b491c668892bc9e13e
|
|
| MD5 |
d1e2fac850824ae776e263410c8b2824
|
|
| BLAKE2b-256 |
4bf5344960f8683f50cdc18116050a444f3c8ba7f2717ce46d4ae2a9f260afb8
|