CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

CoNLL-U Parser

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.

Why should you use conllu?

It's simple. ~300 lines of code.
Works with both Python 2 and Python 3
It has no dependencies
Nice set of tests with CI setup:
It has 100% test coverage
It has

Installation

pip install conllu

Or, if you are using conda:

conda install -c conda-forge conllu

Notes on updating from 0.1 to 1.0

I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 1.0 might require code changes. Here's a guide on how to upgrade to 1.0 .

Example usage

At the top level, conllu provides two methods, parse and parse_tree. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.

Use parse() to parse into a list of sentences

>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""

Now you have the data in a variable called data. Let's parse it:

>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, ...>]

Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_incr() instead of parse. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenLists out. Here's how you would use it:

from io import open
from conllu import parse_incr

data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
    print(tokenlist)

For most files, parse works fine.

Since one CoNLL-U file usually contains multiple sentences, parse() always returns a list of sentences. Each sentence is represented by a TokenList.

>>> sentence = sentences[0]
TokenList<The, quick, brown, fox, ...>

The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:

>>> token = sentence[0]
>>> token
OrderedDict([
    ('id', 1),
    ('form', 'The'),
    ('lemma', 'the'),
    ...
])
>>> token["form"]
'The'

Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata.

>>> sentence.metadata
OrderedDict([
    ("text", "The quick brown fox jumps over the lazy dog."),
    ...
])

If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize() method:

>>> sentence.serialize()
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
...

You can also convert a TokenList to a TokenTree by using to_tree:

>>> sentence.to_tree()
TokenTree<token={id=5, form=jumps}, children=[...]>

That's it!

Use parse_tree() to parse into a list of dependency trees

Sometimes you're interested in the tree structure that hides in the head column of a CoNLL-U file. When this is the case, use parse_tree to get a nested structure representing the sentence.

>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]

Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_tree_incr() instead of parse_tree. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenTrees out. Here's how you would use it:

from io import open
from conllu import parse_tree_incr

data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokentree in parse_tree_incr(data_file):
    print(tokentree)

Since one CoNLL-U file usually contains multiple sentences, parse_tree() always returns a list of sentences. Each sentence is represented by a TokenTree.

>>> root = sentences[0]
>>> root
TokenTree<token={id=5, form=jumps, ...}, children=...>

To quickly visualize the tree structure you can call print_tree on a TokenTree.

>>> root.print_tree()
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
    (deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
        (deprel:det) form:The lemma:the upostag:DET [1]
        (deprel:amod) form:quick lemma:quick upostag:ADJ [2]
        (deprel:amod) form:brown lemma:brown upostag:ADJ [3]
    (deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
        (deprel:case) form:over lemma:over upostag:ADP [6]
        (deprel:det) form:the lemma:the upostag:DET [7]
        (deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
    (deprel:punct) form:. lemma:. upostag:PUNCT [10]

To access the token corresponding to the current node in the tree, use token:

>>> root.token
OrderedDict([
    ('id', 5),
    ('form', 'jumps'),
    ('lemma', 'jump'),
    ...
])

To start walking down the children of the current node, use the children attribute:

>>> children = root.children
>>> children
[
    TokenTree<token={id=4, form=fox, ...}, children=...>,
    TokenTree<token={id=9, form=dog, ...}, children=...>,
    TokenTree<token={id=10, form=., ...}, children=...>,
]

Just like with parse(), if a sentence has metadata it is available in a property on the TokenTree root called metadata.

>>> root.metadata
OrderedDict([
    ("text", "The quick brown fox jumps over the lazy dog."),
    ...
])

If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize() method:

>>> root.serialize()
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
...

You can read about the CoNLL-U format at the Universial Dependencies project.

Develop locally and run the tests

Make a fork of the repository to your own GitHub account.

Clone the repository locally on your computer:

git clone git@github.com:YOURUSERNAME/conllu.git conllu
cd conllu

Install the library used for running the tests:
```
pip install tox
```
Now you can run the tests:
```
tox
```
This runs tox across all supported versions of Python, and also runs checks for code-coverage, syntax errors, and how imports are sorted.
(Alternative) If you just have one version of python installed, and don't want to go through the hassle of installing multiple version of python (hint: Install pyenv and pyenv-tox), it's fine to run tox with just one version of python:
```
tox -e py36
```
Make a pull request. Here's a good guide on PRs from GitHub.

Thanks for helping conllu become a better library!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

4.5.3

Jun 19, 2023

4.5.2

Jul 29, 2022

4.5.1

Jul 13, 2022

4.5

Jul 12, 2022

4.4.3

Jul 12, 2022

4.4.2

May 1, 2022

4.4.1

Aug 16, 2021

4.4

Feb 13, 2021

4.3.1

Feb 6, 2021

4.3

Jan 16, 2021

4.2.2

Dec 9, 2020

4.2.1

Oct 22, 2020

4.2

Sep 12, 2020

4.1

Aug 29, 2020

4.0

Aug 15, 2020

3.1.1

Aug 1, 2020

3.1

Aug 1, 2020

3.0

May 24, 2020

2.3.2

Mar 14, 2020

2.3

Feb 29, 2020

2.2.2

Feb 4, 2020

2.2.1

Jan 21, 2020

2.2

Oct 6, 2019

2.1.1

Oct 6, 2019

2.1

Oct 5, 2019

2.0

Sep 21, 2019

1.5

Sep 20, 2019

1.4.1

Sep 11, 2019

1.4

Aug 11, 2019

This version

1.3.2

Aug 9, 2019

1.3.1

Mar 18, 2019

1.3

Mar 17, 2019

1.2.3

Mar 1, 2019

1.2.2

Feb 17, 2019

1.2.1

Oct 5, 2018

1.2

Sep 15, 2018

1.1

Sep 6, 2018

1.0.1

Aug 17, 2018

1.0

Aug 6, 2018

0.11

Aug 4, 2018

0.10.7

Aug 4, 2018

0.10.6

Jun 24, 2018

0.10.5

Jun 24, 2018

0.10.4

Jun 24, 2018

0.10.3

Jun 24, 2018

0.10.2

Jun 24, 2018

0.10.1

Jun 23, 2018

0.10

Jun 23, 2018

0.9

Apr 13, 2018

0.8

Mar 20, 2018

0.7

Jan 6, 2018

0.6.1

Jan 1, 2018

0.6

Jan 1, 2018

0.5

Dec 12, 2017

0.4

Jun 25, 2017

0.3

Aug 28, 2016

0.2

Aug 22, 2016

0.1

Aug 14, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conllu-1.3.2.tar.gz (11.3 kB view hashes)

Uploaded Aug 9, 2019 Source

Built Distribution

conllu-1.3.2-py2.py3-none-any.whl (9.3 kB view hashes)

Uploaded Aug 9, 2019 Python 2 Python 3

Hashes for conllu-1.3.2.tar.gz

Hashes for conllu-1.3.2.tar.gz
Algorithm	Hash digest
SHA256	`fcb1538001242a154f0f8f50ea0f17bec221afbdb2972f18457eb0c320c5b68b`
MD5	`cc8b083ba4a8cf395d1178dbb8616a28`
BLAKE2b-256	`58bb2f16eceda9d1692a813ed7dd7a0b557e95f06c5b1d4dbc0d03085a985dcd`

Hashes for conllu-1.3.2-py2.py3-none-any.whl

Hashes for conllu-1.3.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`e4bfea6888258b0b6f81b6bce7844303a6357e7d8ae518190d59f3136674eced`
MD5	`741620de598f26a94f13325a1d1db58b`
BLAKE2b-256	`56eb82b67b01903cc1ff5fae4404bfc53b63ad3ff34ec72a3008591768aee8bb`