Script to convert data in MMAX format to CoNLL format
Project description
mmax2conll
Script to convert coreference data in MMAX format to CoNLL format or raw text files.
See CoNLL-specification.md
and MMAX-specification.md
for extensive descriptions of the CoNLL and MMAX formats.
Usage
mmax2conll.py
Because the COREA corpus saves its sentence information in the *_words.xml
files
but the SoNaR-1 corpus saves this separately in *_sentence_level.xml
files,
specifying a sentences file is optional.
To automatically find all (sub)folders that contain a Basedata
and Markables
folder as direct children and convert all data in those folders, run:
python -m mmax2conll path/to/config.yml path/to/output_dir -d path/to/some/folder [-d path/to/another/folder ...]
To only convert one pair (or triple) of files, run:
python -m mmax2conll path/to/config.yml path/to/output.conll path/to/some_words.xml path/to/a_coref_level.xml [path/to/a_sentence_level.xml]
mmax2raw.py
To automatically find all (sub)folders that contain a Basedata
and Markables
folder as direct children and convert all data in those folders, run:
mmax2raw.py path/to/output_dir -d path/to/some/folder [-d path/to/another/folder ...]
To only convert one file, run:
mmax2raw.py path/to/output.txt path/to/some_words.xml
Columns of CoNLL output
These scripts were first used to convert data from the COREA (Ch.7 p.115 -- 128) dataset to CoNLL and COREA does not contain the following information:
- POS tags
- constituency tree
- predicates
- speaker/author information
- named entities
Therefore these scripts strictly do not output data in CoNLL format. The following values and place-holders are used.
Column | Description | Value | Conform CoNLL specification? |
---|---|---|---|
1 | Document ID | file path without extension | Yes |
2 | Part number | 0 or as extracted from <word>.alpsent from MMAX *_words.xml files [1] |
Yes |
3 | Word number | <word>.alppos or <word>.pos from MMAX *_words.xml files |
Yes |
4 | Word itself | content of <word> tags from MMAX *_words.xml files |
Yes |
5 | POS | [POS] |
No |
6 | Parse bit | * |
No |
7 | Predicate lemma | - |
Yes |
8 | Predicate Frameset ID | - |
Yes |
9 | Word sense | - |
Yes |
10 | Speaker/Author | UNKNOWN |
??? |
11 | Named Entities | * |
Yes |
- | Predicate Arguments | None: column(s) left out entirely | Yes, conform example in CoNLL 2012 |
12 | Coreference | extracted from MMAX *_coref_level.xml files (ISSUE! [2]) |
Yes |
[1]: The part numbers of DCOI start at 1, where the part numbers in a CoNLL file start at 0. To keep the origin of the data clear this 1-based part number is not changed, but instead an empty part 0 is added to those files.
[2]:
The reference spans are not closed in the correct order if they end at the same word. The following is an example of output from mmax2conll
:
(10
-
(52|(55
52)
-
10)|55)|(133)
While pedantically correct would be:
(10
-
(55|(52
52)
-
(133)|55)|10)
Issues
- Skipping a whole file one any error is too wasteful
- 'on_missing' config key is not validated before use
-
basedata_dir
andmarkables_dir
should not be configuration keys - Too many methods in
main.py
are marked as class methods
References
Christoph Müller and Michael Strube.
Multi-Level Annotation in MMAX
In Proceedings of the 4th SIGDIAL Workshop. 2003.
URL http://www.speech.cs.cmu.edu/sigdial2003/proceedings/07_LONG_strube_paper.pdf
Iris Hendrickx, Gosse Bouma, Walter Daelemans and Véronique Hoste.
COREA: Coreference Resolution for Extracting Answers for Dutch
Essential Speech and Language Technology for Dutch, Ch.7, p.115 -- 128. 2013.
Editors: Peter Spyns, Jan Odijk
https://link.springer.com/book/10.1007/978-3-642-30910-6
10.1007/978-3-642-30910-6
SoNaR: https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mmax2conll-1.0.1.zip
.
File metadata
- Download URL: mmax2conll-1.0.1.zip
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b32fac96b3c987c92fbaaca5da2f653215f4c85b0d617cbffbc7e06e9aa285c5 |
|
MD5 | 66d554e429dec57766a322cea8e737cc |
|
BLAKE2b-256 | 696bfe50ffa021bb6376de67ffa40fe3435f0d2fbc3de2d2aea80806f32d29fa |