CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary
Project description
CoNLL-U Parser
CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.
Why should you use conllu?
- It's simple. ~150 lines of code (including whitespace).
- Works with both Python 2 and Python 3
- It has no dependencies
- Nice set of tests with CI setup:
- It has 100% test coverage
- It has
Installation
pip install conllu
Or, if you are using conda:
conda install -c conda-forge conllu
Notes on updating from 0.1 to 1.0
I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 1.0 might require code changes. Here's a guide on how to upgrade to 1.0 .
Example usage
At the top level, conllu provides two methods, parse
and parse_tree
. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.
Use parse() to parse into a list of sentences
>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
"""
Now you have the data in a variable called data
. Let's parse it:
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, ...>]
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_incr()
instead of parse
. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenLists out. Here's how you would use it:
from io import open
from conllu import parse_incr
data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
print(tokenlist)
For most files, parse
works fine.
Since one CoNLL-U file usually contains multiple sentences, parse()
always returns a list of sentences. Each sentence is represented by a TokenList.
>>> sentence = sentences[0]
TokenList<The, quick, brown, fox, ...>
The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:
>>> token = sentence[0]
>>> token
OrderedDict([
('id', 1),
('form', 'The'),
('lemma', 'the'),
...
])
>>> token["form"]
'The'
Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata
.
>>> sentence.metadata
OrderedDict([
("text", "The quick brown fox jumps over the lazy dog."),
...
])
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize()
method:
>>> sentence.serialize()
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
...
You can also convert a TokenList to a TokenTree by using to_tree
:
>>> sentence.to_tree()
TokenTree<token={id=5, form=jumps}, children=[...]>
That's it!
Use parse_tree() to parse into a list of dependency trees
Sometimes you're interested in the tree structure that hides in the head
column of a CoNLL-U file. When this is the case, use parse_tree
to get a nested structure representing the sentence.
>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_tree_incr()
instead of parse_tree
. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenTrees out. Here's how you would use it:
from io import open
from conllu import parse_tree_incr
data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokentree in parse_tree_incr(data_file):
print(tokentree)
Since one CoNLL-U file usually contains multiple sentences, parse_tree()
always returns a list of sentences. Each sentence is represented by a TokenTree.
>>> root = sentences[0]
>>> root
TokenTree<token={id=5, form=jumps, ...}, children=...>
The quickly see the tree structure you can call print_tree
on a TokenTree.
>>> root.print_tree()
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
(deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
(deprel:det) form:The lemma:the upostag:DET [1]
(deprel:amod) form:quick lemma:quick upostag:ADJ [2]
(deprel:amod) form:brown lemma:brown upostag:ADJ [3]
(deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
(deprel:case) form:over lemma:over upostag:ADP [6]
(deprel:det) form:the lemma:the upostag:DET [7]
(deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
(deprel:punct) form:. lemma:. upostag:PUNCT [10]
To access the token corresponding to the current node in the tree, use token
:
>>> root.token
OrderedDict([
('id', 5),
('form', 'jumps'),
('lemma', 'jump'),
...
])
To start walking down the children of the current node, use the children attribute:
>>> children = root.children
>>> children
[
TokenTree<token={id=4, form=fox, ...}, children=...>,
TokenTree<token={id=9, form=dog, ...}, children=...>,
TokenTree<token={id=10, form=., ...}, children=...>,
]
Just like with parse()
, if a sentence has metadata it is available in a property on the TokenTree root called metadata
.
>>> root.metadata
OrderedDict([
("text", "The quick brown fox jumps over the lazy dog."),
...
])
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize()
method:
>>> root.serialize()
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
...
You can read about the CoNLL-U format at the Universial Dependencies project.
Develop locally and run the tests
-
Make a fork of the repository to your own GitHub account.
-
Clone the repository locally on your computer:
git clone git@github.com:YOURUSERNAME/conllu.git conllu cd conllu
-
Install the library used for running the tests:
pip install tox
-
Now you can run the tests:
tox
This runs tox across all supported versions of Python, and also runs checks for code-coverage, syntax errors, and how imports are sorted.
-
(Alternative) If you just have one version of python installed, and don't want to go through the hassle of installing multiple version of python (hint: Install pyenv and pyenv-tox), it's fine to run tox with just one version of python:
tox -e py36
-
Make a pull request. Here's a good guide on PRs from GitHub.
Thanks for helping conllu become a better library!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for conllu-1.2.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97e8db89e3c12bdeaa8c65edb82cd267b9f0155748b07cfa489d6e360fd259a0 |
|
MD5 | 7fa942bab367ee3f7b438e7efe9f0355 |
|
BLAKE2b-256 | 81217d25cb06cf0318ca7014af3e732610eed28b8b8aad4c7479d1326f27f2b6 |