Skip to main content

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary

Project description

CoNLL-U Parser

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.

Why should you use conllu?

  • It's simple. ~300 lines of code.
  • Works with both Python 2 and Python 3
  • It has no dependencies
  • Nice set of tests with CI setup: Build status on Travis
  • It has 100% test coverage (and has undergone mutation testing)
  • It has lots of downloads

Installation

pip install conllu

Or, if you are using conda:

conda install -c conda-forge conllu

Notes on updating from 0.1 to 1.0

I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 1.0 might require code changes. Here's a guide on how to upgrade to 1.0 .

Example usage

At the top level, conllu provides two methods, parse and parse_tree. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.

Use parse() to parse into a list of sentences

>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""

Now you have the data in a variable called data. Let's parse it:

>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]

Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_incr() instead of parse. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenLists out. Here's how you would use it:

from io import open
from conllu import parse_incr

data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
    print(tokenlist)

For most files, parse works fine.

Since one CoNLL-U file usually contains multiple sentences, parse() always returns a list of sentences. Each sentence is represented by a TokenList.

>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>

The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:

>>> token = sentence[0]
>>> token
OrderedDict([
    ('id', 1),
    ('form', 'The'),
    ('lemma', 'the'),
    ...
])
>>> token["form"]
'The'

New in conllu 2.0: filter() a TokenList

>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>
>>> sentence.filter(form="quick")
TokenList<quick>

By using filter(field1__field2=value) you can filter based on subelements further down in a parsed token.

>>> sentence.filter(feats__Degree="Pos")
TokenList<quick, brown, lazy>

Filters can also be chained (meaning you can do sentence.filter(...).filter(...)), and filtering on multiple properties at the same time (sentence.filter(field1=value1, field2=value2)) means that ALL properties must match.

Parse metadata from a CoNLL-U file

Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata.

>>> sentence.metadata
OrderedDict([
    ('text', 'The quick brown fox jumps over the lazy dog.')
])

Turn a TokenList back into CoNLL-U

If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize() method:

>>> sentence.serialize()
# text = The quick brown fox jumps over the lazy dog.
1   The     the     DET    DT   Definite=Def|PronType=Art   4   det    _   _
2   quick   quick   ADJ    JJ   Degree=Pos                  4   amod   _   _
3   brown   brown   ADJ    JJ   Degree=Pos                  4   amod   _   _
4   fox     fox     NOUN   NN   Number=Sing                 5   nsubj  _   _
5   jumps   jump    VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root   _   _
6   over    over    ADP    IN   _                           9   case   _   _
7   the     the     DET    DT   Definite=Def|PronType=Art   9   det    _   _
8   lazy    lazy    ADJ    JJ   Degree=Pos                  9   amod   _   _
9   dog     dog     NOUN   NN   Number=Sing                 5   nmod   _   SpaceAfter=No
10  .       .       PUNCT  .    _                           5   punct  _   _

Turn a TokenList into a TokenTree (see below)

You can also convert a TokenList to a TokenTree by using to_tree:

>>> sentence.to_tree()
TokenTree<token={id=5, form=jumps}, children=[...]>

That's it!

Use parse_tree() to parse into a list of dependency trees

Sometimes you're interested in the tree structure that hides in the head column of a CoNLL-U file. When this is the case, use parse_tree to get a nested structure representing the sentence.

>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]

Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_tree_incr() instead of parse_tree. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenTrees out. Here's how you would use it:

from io import open
from conllu import parse_tree_incr

data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokentree in parse_tree_incr(data_file):
    print(tokentree)

Since one CoNLL-U file usually contains multiple sentences, parse_tree() always returns a list of sentences. Each sentence is represented by a TokenTree.

>>> root = sentences[0]
>>> root
TokenTree<token={id=5, form=jumps}, children=[...]>

To quickly visualize the tree structure you can call print_tree on a TokenTree.

>>> root.print_tree()
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
    (deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
        (deprel:det) form:The lemma:the upostag:DET [1]
        (deprel:amod) form:quick lemma:quick upostag:ADJ [2]
        (deprel:amod) form:brown lemma:brown upostag:ADJ [3]
    (deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
        (deprel:case) form:over lemma:over upostag:ADP [6]
        (deprel:det) form:the lemma:the upostag:DET [7]
        (deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
    (deprel:punct) form:. lemma:. upostag:PUNCT [10]

To access the token corresponding to the current node in the tree, use token:

>>> root.token
OrderedDict([
    ('id', 5),
    ('form', 'jumps'),
    ('lemma', 'jump'),
    ...
])

To start walking down the children of the current node, use the children attribute:

>>> children = root.children
>>> children
[
    TokenTree<token={id=4, form=fox}, children=[...]>,
    TokenTree<token={id=9, form=dog}, children=[...]>,
    TokenTree<token={id=10, form=.}, children=None>
]

Just like with parse(), if a sentence has metadata it is available in a property on the TokenTree root called metadata.

>>> root.metadata
OrderedDict([
    ('text', 'The quick brown fox jumps over the lazy dog.')
])

If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize() method:

>>> root.serialize()
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
...

If you want to write it back to a file, you can use something like this:

>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> 
>>> # Make some change to sentences here
>>> 
>>> with open('file-to-write-to', 'w') as f:
...     f.writelines([sentence.serialize() + "\n" for sentence in sentences])

Customizing parsing to handle strange variations of CoNLL-U

Far from all CoNLL-U files found in the wild follow the CoNLL-U format specification. CoNLL-U tries to parse even files that are malformed according to the specification, but sometimes that doesn't work. For those situations you can change how conllu parses your files.

A normal CoNLL-U file consists of a specific set of fields (id, form, lemma, and so on...). Let's walk through how to parse a custom format using the three options fields, field_parsers, metadata_parsers. Here's the custom format we'll use.

>>> data = """
# tagset = TAG1|TAG2|TAG3|TAG4
# sentence-123
1   My       TAG1|TAG2
2   custom   TAG3
3   format   TAG4

"""

Now, let's parse this with the the default settings, and looks specifically at the first token to see how it was parsed.

>>> sentences = parse(data)
>>> sentences[0][0]
OrderedDict([('id', 1), ('form', 'My'), ('lemma', 'TAG1|TAG2')])

The parser has assumed (incorrectly) that the third field must the the default ´lemma´ field and parsed it as such. Let's customize this so the parser gets the name right, by setting the fields parameter when calling parse.

>>> sentences = parse(data, fields=["id", "form", "tag"])
>>> sentences[0][0]
OrderedDict([('id', 1), ('form', 'My'), ('tag', 'TAG1|TAG2')])

The only difference is that you now get the correct field name back when parsing. How let's say you want those two tags returned as a list instead of as a string you have to split. This can be done using field_parsers.

>>> split_func = lambda line, i: line[i].split("|")
>>> sentences = parse(data, fields=["id", "form", "tag"], field_parsers={"tag": split_func})
>>> sentences[0][0]
OrderedDict([('id', 1), ('form', 'My'), ('tag', ['TAG1', 'TAG2'])])

That's much better! field_parsers specifies a mapping from a field name, to a function that can parse that field. In our case, we specify that the field with custom logic is "tag" and that the function to handle it is split_func. Each field_parser gets sent two parameters:

  • line: The whole list of values from this line, split on whitespace. The reason you get the full line is so you can merge several tokens into one using a field_parser if you wanted.
  • i: The current location in the line where you currently are. Most often, you'll use line[i] to get the current value.

In our case, we return line[i].split("|"), which returns a list, just like we want.

Let's look at the metadata in this example.

"""
# tagset = TAG1|TAG2|TAG3|TAG4
# sentence-123
"""

None of these values are valid in CoNLL-U, but since the first line follows the key-value format of other (valid) fields, conllu will parse it anyway:

>>> sentences = parse(data)
>>> sentences[0].metadata
OrderedDict([('tagset', 'TAG1|TAG2|TAG3|TAG4')])

Let's return this as a list using the metadata_parsers parameter.

>>> sentences = parse(data, metadata_parsers={"tagset": lambda key, value: (key, value.split("|"))})
>>> sentences[0].metadata
OrderedDict([('tagset', ['TAG1', 'TAG2', 'TAG3', 'TAG4'])])

A metadata parser behaves similarily as a field parser, but since most comments you'll see will be of the form "key = value" these values will be parsed and cleaned first, and then sent to your custom metadata_parser. Here we just take the value, and split it on "|", and return a list back. And lo and behold, we get what we wanted!

Now, let's deal with the "sentence-123" comment. Specifying another metadata_parser won't work, because this is an ID that will be different for each sentence. Instead, let's use a special metadata parser, called __fallback__.

>>> sentences = parse(data, metadata_parsers={
...    "tagset": lambda key, value: (key, value.split("|")),
...    "__fallback__": lambda key, value: ("sentence-id", key)
... })
>>> sentences[0].metadata
OrderedDict([
    ('tagset', ['TAG1', 'TAG2', 'TAG3', 'TAG4']),
    ('sentence-id', 'sentence-123')
])

Just what we wanted! __fallback__ gets called any time none of the other metadata_parsers match, and just like the others, it gets sent the key and value of the current line. In our case, the line contains no "=" to split on, so key will be "sentence-123" and value will be empty. We can return whatever we want here, but let's just say we want to call this field "sentence-id" so we return that as the key, and "sentence-123" as our value.

Finally, consider an even trickier case.

>>> data = """
# id=1-document_id=36:1047-span=1
1   My       TAG1|TAG2
2   custom   TAG3
3   format   TAG4

"""

This is actually three different comments, but somehow they are separated by "-" instead of on their own lines. To handle this, we get to use the ability of a metadata_parser to return multiple matches from a single line.

>>> sentences = parse(data, metadata_parsers={
...    "__fallback__": lambda key, value: [pair.split("=") for pair in (key + "=" + value).split("-")]
... })
>>> sentences[0].metadata
OrderedDict([
    ('id', '1'),
    ('document_id', '36:1047'),
    ('span', '1')
])

Our fallback parser returns a list of matches, one per pair of metadata comments we find. The key + "=" + value trick is needed since by default conllu assumes that this is a valid comment, so key is "id" and value is everything after the first "=", 1-document_id=36:1047-span=1 (note the missing "id=" in the beginning). We need to add it back before splitting on "-".

And that's it! Using these tricks you should be able to parse all the strange files you stumble into.

Develop locally and run the tests

  1. Make a fork of the repository to your own GitHub account.

  2. Clone the repository locally on your computer:

    git clone git@github.com:YOURUSERNAME/conllu.git conllu
    cd conllu
    
  3. Install the library used for running the tests:

    pip install tox
    
  4. Now you can run the tests:

    tox
    

    This runs tox across all supported versions of Python, and also runs checks for code-coverage, syntax errors, and how imports are sorted.

  5. (Alternative) If you just have one version of python installed, and don't want to go through the hassle of installing multiple version of python (hint: Install pyenv and pyenv-tox), it's fine to run tox with just one version of python:

    tox -e py36
    
  6. Make a pull request. Here's a good guide on PRs from GitHub.

Thanks for helping conllu become a better library!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conllu-2.3.2.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

conllu-2.3.2-py2.py3-none-any.whl (13.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file conllu-2.3.2.tar.gz.

File metadata

  • Download URL: conllu-2.3.2.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for conllu-2.3.2.tar.gz
Algorithm Hash digest
SHA256 73721d17e3346d8423704fe75bfe296f0c4a693489e0c650b71f504c2a300dda
MD5 4cf20ac776c310d0646f589f20d7a3ce
BLAKE2b-256 63fec155285365e5f913f1dcd8e21d3f34f5e2bed5b4942a393c66b529e3f3a6

See more details on using hashes here.

File details

Details for the file conllu-2.3.2-py2.py3-none-any.whl.

File metadata

  • Download URL: conllu-2.3.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for conllu-2.3.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 7b3d433d06e50e3814be7af50d26e43151d8e134d7b84f5389d2ecafa2b65ab2
MD5 9cdf59643a2dc4d5e2a3f708c79a28db
BLAKE2b-256 a8034a952eb39cdc8da80a6a2416252e71784dda6bf9d726ab98065fff2aeb73

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page