Skip to main content

Converts conllu files with features that use the Universal Dependency (UD) annotation schema to features that use the Universal Morphology (UM) schema.

Project description

ud-compatibility

Utility for converting Universal Dependencies–annotated corpora to UniMorph

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of a language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema.

Prerequisites

  • termcolor: pip install termcolor
  • Python 3.5 or later; Anaconda is a simple way to install it.

Usage

The driver of the entire endeavor is the file marry.py, which marries a UD dataset to its affiliated UniMorph.

Conversion

To convert one file to UniMorph, give the path (and optionally the specific language converter you'd like to use).

python marry.py convert --ud my/ud/path/rw-ud-dev.conllu
python marry.py convert --ud my/ud/path/da-ud-dev.conllu -l da

To convert your UD dataset to UniMorph, list the languages you'd like to convert:

python marry.py convert --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

When the input looks like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
2	2	2	NUM	_	NumType=Card	3	nummod	_	_
3	madres	madre	NOUN	_	Gender=Masc|Number=Plur	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

The output will look like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	PRS;V;FIN;3;IND;SG	0	root	_	_
2	2	2	NUM	_	NUM	3	nummod	_	_
3	madres	madre	NOUN	_	N;PL;MASC	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

Evaluation

To assess a conversion (either of the included Translator objects or your own), the syntax is similar:

python marry.py evaluate --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

Replication

To replicate the experiments from the paper, use:

python marry.py replicate 

Data

The individual datasets for Universal Dependencies v2 and UniMorph can be downloaded from their respective projects on GitHub.

Contributing

You're welcome to submit a pull request, harmonizing your UD dataset with the corresponding UniMorph.

  1. Write your own Translator subclass.
  2. Register it in the languages.py list.
  3. Submit the Pull Request.

License

This project is licensed under the GNU GPL v3 license; see the LICENSE.md file for details.

Citation

@InProceedings{mccarthy2018udw,
  author = 	"McCarthy, Arya D.
		and Silfverberg, Miikka
		and Cotterell, Ryan
		and Hulden, Mans
		and Yarowsky, David",
  title = 	"Marrying {U}niversal {D}ependencies and {U}niversal {M}orphology",
  booktitle = 	"Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"91--101",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/W18-6011"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ud_compatibility-0.1.2.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ud_compatibility-0.1.2-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file ud_compatibility-0.1.2.tar.gz.

File metadata

  • Download URL: ud_compatibility-0.1.2.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for ud_compatibility-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6833884bfb08cd4ad75fd03931e6c657d080b4aa1dfe21a939eff297b3e4e88a
MD5 daed2cd7aceccbc030c28622d81381e6
BLAKE2b-256 7adf3ede95342d5641fad4dcca412093d2129650aed270265534147586e44c6c

See more details on using hashes here.

File details

Details for the file ud_compatibility-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ud_compatibility-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2a5632ed7edeb0e735a62a0e4943464b567837d93e81d877c062d53054b92914
MD5 0833c5ce2d9f52dc615ceeb5127534f8
BLAKE2b-256 328370f4873a17fedd2603cfd0a681483d8494cbf4ee77ef806654318be2a340

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page