Skip to main content

Converts conllu files with features that use the Universal Dependency (UD) annotation schema to features that use the Universal Morphology (UM) schema.

Project description

ud-compatibility

Utility for converting Universal Dependencies–annotated corpora to UniMorph

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of a language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema.

Prerequisites

  • termcolor: pip install termcolor
  • Python 3.5 or later; Anaconda is a simple way to install it.

Usage

The driver of the entire endeavor is the file marry.py, which marries a UD dataset to its affiliated UniMorph.

Conversion

To convert one file to UniMorph, give the path (and optionally the specific language converter you'd like to use).

python marry.py convert --ud my/ud/path/rw-ud-dev.conllu
python marry.py convert --ud my/ud/path/da-ud-dev.conllu -l da

To convert your UD dataset to UniMorph, list the languages you'd like to convert:

python marry.py convert --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

When the input looks like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
2	2	2	NUM	_	NumType=Card	3	nummod	_	_
3	madres	madre	NOUN	_	Gender=Masc|Number=Plur	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

The output will look like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	PRS;V;FIN;3;IND;SG	0	root	_	_
2	2	2	NUM	_	NUM	3	nummod	_	_
3	madres	madre	NOUN	_	N;PL;MASC	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

Evaluation

To assess a conversion (either of the included Translator objects or your own), the syntax is similar:

python marry.py evaluate --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

Replication

To replicate the experiments from the paper, use:

python marry.py replicate 

Data

The individual datasets for Universal Dependencies v2 and UniMorph can be downloaded from their respective projects on GitHub.

Contributing

You're welcome to submit a pull request, harmonizing your UD dataset with the corresponding UniMorph.

  1. Write your own Translator subclass.
  2. Register it in the languages.py list.
  3. Submit the Pull Request.

License

This project is licensed under the GNU GPL v3 license; see the LICENSE.md file for details.

Citation

@InProceedings{mccarthy2018udw,
  author = 	"McCarthy, Arya D.
		and Silfverberg, Miikka
		and Cotterell, Ryan
		and Hulden, Mans
		and Yarowsky, David",
  title = 	"Marrying {U}niversal {D}ependencies and {U}niversal {M}orphology",
  booktitle = 	"Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"91--101",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/W18-6011"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ud_compatibility-0.1.1.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ud_compatibility-0.1.1-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file ud_compatibility-0.1.1.tar.gz.

File metadata

  • Download URL: ud_compatibility-0.1.1.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for ud_compatibility-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a5e9e1351ec089244a22e72c8a853a39642625a82719b79bdd76e78248a8d4c0
MD5 44f7a8ec26f8f2732f8f0360a351c974
BLAKE2b-256 ac56057428054cbffaad10b00288623b5317ecf152bc31ed237561f907e8cef8

See more details on using hashes here.

File details

Details for the file ud_compatibility-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ud_compatibility-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 00abe179f274725fac0108833a1b43bab1792938c110deb6128bf48dd2725825
MD5 a50b30c86f7e21402576d5f827a4b55a
BLAKE2b-256 63d40221fd5a427909f98a443b0263aae118c486c94608a14fa631bbb6a161be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page