Skip to main content

Converts conllu files with features that use the Universal Dependency (UD) annotation schema to features that use the Universal Morphology (UM) schema.

Project description

ud-compatibility

Utility for converting Universal Dependencies–annotated corpora to UniMorph

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of a language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema.

Prerequisites

  • termcolor: pip install termcolor
  • Python 3.5 or later; Anaconda is a simple way to install it.

Usage

The driver of the entire endeavor is the file marry.py, which marries a UD dataset to its affiliated UniMorph.

Conversion

To convert one file to UniMorph, give the path (and optionally the specific language converter you'd like to use).

python marry.py convert --ud my/ud/path/rw-ud-dev.conllu
python marry.py convert --ud my/ud/path/da-ud-dev.conllu -l da

To convert your UD dataset to UniMorph, list the languages you'd like to convert:

python marry.py convert --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

When the input looks like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
2	2	2	NUM	_	NumType=Card	3	nummod	_	_
3	madres	madre	NOUN	_	Gender=Masc|Number=Plur	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

The output will look like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	PRS;V;FIN;3;IND;SG	0	root	_	_
2	2	2	NUM	_	NUM	3	nummod	_	_
3	madres	madre	NOUN	_	N;PL;MASC	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

Evaluation

To assess a conversion (either of the included Translator objects or your own), the syntax is similar:

python marry.py evaluate --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

Replication

To replicate the experiments from the paper, use:

python marry.py replicate 

Data

The individual datasets for Universal Dependencies v2 and UniMorph can be downloaded from their respective projects on GitHub.

Contributing

You're welcome to submit a pull request, harmonizing your UD dataset with the corresponding UniMorph.

  1. Write your own Translator subclass.
  2. Register it in the languages.py list.
  3. Submit the Pull Request.

License

This project is licensed under the GNU GPL v3 license; see the LICENSE.md file for details.

Citation

@InProceedings{mccarthy2018udw,
  author = 	"McCarthy, Arya D.
		and Silfverberg, Miikka
		and Cotterell, Ryan
		and Hulden, Mans
		and Yarowsky, David",
  title = 	"Marrying {U}niversal {D}ependencies and {U}niversal {M}orphology",
  booktitle = 	"Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"91--101",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/W18-6011"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ud_compatibility-0.1.0.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ud_compatibility-0.1.0-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file ud_compatibility-0.1.0.tar.gz.

File metadata

  • Download URL: ud_compatibility-0.1.0.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for ud_compatibility-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0a62386a015586d5b5ad527f4bd1a025d44cf39ba857f1d8d675d4e6478268de
MD5 ddfe1188292de5ed22038eee90610572
BLAKE2b-256 b5284a29fdb4063e8c12d16a82fd17419a77d26e3076001dfdcf14499788912e

See more details on using hashes here.

File details

Details for the file ud_compatibility-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ud_compatibility-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 778bff527ec49902207d5600bd84d539f376cea6c184dd8c1a5dc2bbd773560e
MD5 7a539f43d8d2138b7b0325b4c898fd5c
BLAKE2b-256 333c42113f3a86b1b76895055b9cfce95bb4022dce5be4e42a01cccbb2f23255

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page