Project description

mmax2conll

Script to convert coreference data in MMAX format to CoNLL format or raw text files.

See CoNLL-specification.md and MMAX-specification.md for extensive descriptions of the CoNLL and MMAX formats.

Usage

`mmax2conll.py`

Because the COREA corpus saves its sentence information in the *_words.xml files but the SoNaR-1 corpus saves this separately in *_sentence_level.xml files, specifying a sentences file is optional.

To automatically find all (sub)folders that contain a Basedata and Markables folder as direct children and convert all data in those folders, run:

python -m mmax2conll path/to/config.yml path/to/output_dir -d path/to/some/folder [-d path/to/another/folder ...]

To only convert one pair (or triple) of files, run:

python -m mmax2conll path/to/config.yml path/to/output.conll path/to/some_words.xml path/to/a_coref_level.xml [path/to/a_sentence_level.xml]

`mmax2raw.py`

To automatically find all (sub)folders that contain a Basedata and Markables folder as direct children and convert all data in those folders, run:

mmax2raw.py path/to/output_dir -d path/to/some/folder [-d path/to/another/folder ...]

To only convert one file, run:

mmax2raw.py path/to/output.txt path/to/some_words.xml

Columns of CoNLL output

These scripts were first used to convert data from the COREA (Ch.7 p.115 -- 128) dataset to CoNLL and COREA does not contain the following information:

POS tags
constituency tree
predicates
speaker/author information
named entities

Therefore these scripts strictly do not output data in CoNLL format. The following values and place-holders are used.

Column	Description	Value	Conform CoNLL specification?
1	Document ID	file path without extension	Yes
2	Part number	`0` or as extracted from `<word>.alpsent` from MMAX `*_words.xml` files [1]	Yes
3	Word number	`<word>.alppos` or `<word>.pos` from MMAX `*_words.xml` files	Yes
4	Word itself	content of `<word>` tags from MMAX `*_words.xml` files	Yes
5	POS	`[POS]`	No
6	Parse bit	`*`	No
7	Predicate lemma	`-`	Yes
8	Predicate Frameset ID	`-`	Yes
9	Word sense	`-`	Yes
10	Speaker/Author	`UNKNOWN`	???
11	Named Entities	`*`	Yes
-	Predicate Arguments	None: column(s) left out entirely	Yes, conform example in CoNLL 2012
12	Coreference	extracted from MMAX `*_coref_level.xml` files (ISSUE! [2])	Yes

[1]: The part numbers of DCOI start at 1, where the part numbers in a CoNLL file start at 0. To keep the origin of the data clear this 1-based part number is not changed, but instead an empty part 0 is added to those files.

[2]: The reference spans are not closed in the correct order if they end at the same word. The following is an example of output from mmax2conll:

          (10
            -
      (52|(55
          52)
            -
10)|55)|(133)

While pedantically correct would be:

          (10
            -
      (55|(52
          52)
            -
(133)|55)|10)

Issues

Skipping a whole file one any error is too wasteful
'on_missing' config key is not validated before use
basedata_dir and markables_dir should not be configuration keys
Too many methods in main.py are marked as class methods

References

Christoph Müller and Michael Strube.
Multi-Level Annotation in MMAX
In Proceedings of the 4th SIGDIAL Workshop. 2003.
URL http://www.speech.cs.cmu.edu/sigdial2003/proceedings/07_LONG_strube_paper.pdf

Iris Hendrickx, Gosse Bouma, Walter Daelemans and Véronique Hoste.
COREA: Coreference Resolution for Extracting Answers for Dutch
Essential Speech and Language Technology for Dutch, Ch.7, p.115 -- 128. 2013.
Editors: Peter Spyns, Jan Odijk
https://link.springer.com/book/10.1007/978-3-642-30910-6
10.1007/978-3-642-30910-6

SoNaR: https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Apr 15, 2019

This version

1.0.0

Jul 12, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmax2conll-1.0.0.zip (30.7 kB view hashes)

Uploaded Jul 12, 2018 Source

Hashes for mmax2conll-1.0.0.zip

Hashes for mmax2conll-1.0.0.zip
Algorithm	Hash digest
SHA256	`899f43dbd2bf5bbba3f1f91bbdb964c382c4843a8d729bb2e7fc5c9974a1474d`
MD5	`39b62f977e857b52ff67528fbf64b9ba`
BLAKE2b-256	`101c4151a19041793023352644ecdb3e4a2801e9413c5cc3cfdb9cb76c7c2d4f`