Converts conllu files with features that use the Universal Dependency (UD) annotation schema to features that use the Universal Morphology (UM) schema.
Project description
ud-compatibility
Utility for converting Universal Dependencies–annotated corpora to UniMorph
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of a language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema.
Prerequisites
- termcolor:
pip install termcolor - Python 3.5 or later; Anaconda is a simple way to install it.
Usage
The driver of the entire endeavor is the file marry.py, which marries a UD dataset to its affiliated UniMorph.
Conversion
To convert one file to UniMorph, give the path (and optionally the specific language converter you'd like to use).
python marry.py convert --ud my/ud/path/rw-ud-dev.conllu
python marry.py convert --ud my/ud/path/da-ud-dev.conllu -l da
To convert your UD dataset to UniMorph, list the languages you'd like to convert:
python marry.py convert --langs he ro de it no_bokmaal
(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)
When the input looks like this:
# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1 Tiene tener VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
2 2 2 NUM _ NumType=Card 3 nummod _ _
3 madres madre NOUN _ Gender=Masc|Number=Plur 1 obj _ SpaceAfter=No
4 . . PUNCT _ _ 1 punct _ _
The output will look like this:
# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1 Tiene tener VERB _ PRS;V;FIN;3;IND;SG 0 root _ _
2 2 2 NUM _ NUM 3 nummod _ _
3 madres madre NOUN _ N;PL;MASC 1 obj _ SpaceAfter=No
4 . . PUNCT _ _ 1 punct _ _
Evaluation
To assess a conversion (either of the included Translator objects or your own), the syntax is similar:
python marry.py evaluate --langs he ro de it no_bokmaal
(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)
Replication
To replicate the experiments from the paper, use:
python marry.py replicate
Data
The individual datasets for Universal Dependencies v2 and UniMorph can be downloaded from their respective projects on GitHub.
Contributing
You're welcome to submit a pull request, harmonizing your UD dataset with the corresponding UniMorph.
- Write your own
Translatorsubclass. - Register it in the
languages.pylist. - Submit the Pull Request.
License
This project is licensed under the GNU GPL v3 license; see the LICENSE.md file for details.
Citation
@InProceedings{mccarthy2018udw,
author = "McCarthy, Arya D.
and Silfverberg, Miikka
and Cotterell, Ryan
and Hulden, Mans
and Yarowsky, David",
title = "Marrying {U}niversal {D}ependencies and {U}niversal {M}orphology",
booktitle = "Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "91--101",
location = "Brussels, Belgium",
url = "http://aclweb.org/anthology/W18-6011"
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ud_compatibility-0.1.2.tar.gz.
File metadata
- Download URL: ud_compatibility-0.1.2.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6833884bfb08cd4ad75fd03931e6c657d080b4aa1dfe21a939eff297b3e4e88a
|
|
| MD5 |
daed2cd7aceccbc030c28622d81381e6
|
|
| BLAKE2b-256 |
7adf3ede95342d5641fad4dcca412093d2129650aed270265534147586e44c6c
|
File details
Details for the file ud_compatibility-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ud_compatibility-0.1.2-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a5632ed7edeb0e735a62a0e4943464b567837d93e81d877c062d53054b92914
|
|
| MD5 |
0833c5ce2d9f52dc615ceeb5127534f8
|
|
| BLAKE2b-256 |
328370f4873a17fedd2603cfd0a681483d8494cbf4ee77ef806654318be2a340
|