CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary
Project description
CoNLL-U Parser
CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.
Why should you use conllu?
- It's simple. ~150 lines of code (including whitespace).
- Works with both Python 2 and Python 3
- It has no dependencies
- Nice set of tests with CI setup:
- It has 100% test coverage
- It has
Installation
pip install conllu
Or, if you are using conda:
conda install -c conda-forge conllu
Notes on updating from 0.1 to 0.2
I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 0.2 might require code changes. Here's a guide on how to upgrade to 0.2.
Example usage
At the top level, conllu provides two methods, parse
and parse_tree
. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.
Use parse() to parse into a list of sentences
>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
"""
Now you have the data in a variable called data
. Let's parse it:
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, ...>]
Since one CoNLL-U file usually contains multiple sentences, parse()
always returns a list of sentences. Each sentence is represented by a TokenList.
>>> sentence = sentences[0]
TokenList<The, quick, brown, fox, ...>
The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:
>>> token = sentence[0]
>>> token
OrderedDict([
('id', 1),
('form', 'The'),
('lemma', 'the'),
...
])
>>> token["form"]
'The'
Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata
.
>>> sentence.metadata
OrderedDict([
("text", "The quick brown fox jumps over the lazy dog."),
...
])
That's it!
Use parse_tree() to parse into a list of dependency trees
Sometimes you're interested in the tree structure that hides in the head
column of a CoNLL-U file. When this is the case, use parse_tree
to get a nested structure representing the sentence.
>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]
Since one CoNLL-U file usually contains multiple sentences, parse_tree()
always returns a list of sentences. Each sentence is represented by a TokenTree.
>>> root = sentences[0]
>>> root
TokenTree<token={id=5, form=jumps, ...}, children=...>
The quickly see the tree structure you can call print
on a TokenTree.
>>> root.print_tree()
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
(deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
(deprel:det) form:The lemma:the upostag:DET [1]
(deprel:amod) form:quick lemma:quick upostag:ADJ [2]
(deprel:amod) form:brown lemma:brown upostag:ADJ [3]
(deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
(deprel:case) form:over lemma:over upostag:ADP [6]
(deprel:det) form:the lemma:the upostag:DET [7]
(deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
(deprel:punct) form:. lemma:. upostag:PUNCT [10]
To access the token corresponding to the current node in the tree, use token
:
>>> root.token
OrderedDict([
('id', 5),
('form', 'jumps'),
('lemma', 'jump'),
...
])
To start walking down the children of the current node, use the children attribute:
>>> children = root.children
>>> children
[
TokenTree<token={id=4, form=fox, ...}, children=...>,
TokenTree<token={id=9, form=dog, ...}, children=...>,
TokenTree<token={id=10, form=., ...}, children=...>,
]
Just like with parse()
, if a sentence has metadata it is available in a property on the TokenTree root called metadata
.
>>> root.metadata
OrderedDict([
("text", "The quick brown fox jumps over the lazy dog."),
...
])
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize()
method:
>>> root.serialize()
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
...
You can read about the CoNLL-U format at the Universial Dependencies project.
Develop locally and run the tests
git clone git@github.com:EmilStenstrom/conllu.git
cd conllu
Now you can run the tests:
python runtests.py
To check that all code really has test, I use a library called coverage. It runs through all code and checks for things that does NOT have tests. This project requires 100% test coverage, and you can easily check if you missed something using this command:
coverage run --source conllu runtests.py; coverage report -m
Finally, make sure you follow this project's coding standard by running flake8 on all code.
flake8 conllu tests
All these three tests will be run on your finished pull request, and tell you if something went wrong.
Thanks for helping conllu become a better library!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file conllu-1.0.tar.gz
.
File metadata
- Download URL: conllu-1.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 857d673d3dca6a570f3463ee160af388e66c9e7d2d5fb19ddf35f09a0f1679cc |
|
MD5 | 9fc2823419c0560125dfd0a084a63ce4 |
|
BLAKE2b-256 | b30a0f8f0511a5d03c59073329a43146112749713ea372419faa2772ccd2a2ac |
File details
Details for the file conllu-1.0-py2.py3-none-any.whl
.
File metadata
- Download URL: conllu-1.0-py2.py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2145937307746e37aef07882ac293aea4751899fc8ee6c659ffbc03d01863290 |
|
MD5 | dcf1bf0abfdd44c420c94dca5c9dba8c |
|
BLAKE2b-256 | 2cc487357f6f494948a012c66cc8b30b5a22e483b82a16f1780645650d98d0b0 |