CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary
Project description
CoNLL-U Parser
CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.
Why should you use conllu?
- It's simple. ~150 lines of code (including whitespace).
- Works with both Python 2 and Python 3
- It has no dependencies
- Nice set of tests with CI setup:
- It has 100% test coverage
- It has
Installation
pip install conllu
Or, if you are using conda:
conda install -c conda-forge conllu
Example usage
>>> from conllu import parse, parse_tree
>>> data = """
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
"""
>>> # GitHub replaces tab characters with spaces so for this code to be copy-pastable
>>> # I've added the following two lines. You don't need them in your code
>>> import re
>>> data = re.sub(r" +", r"\t", data)
>>> parse(data)
[[
OrderedDict([
('id', 1),
('form', 'The'),
('lemma', 'the'),
('upostag', 'DET'),
('xpostag', 'DT'),
('feats', OrderedDict([('Definite', 'Def'), ('PronType', 'Art')])),
('head', 4),
('deprel', 'det'),
('deps', None),
('misc', None)
]),
OrderedDict([
('id', 2),
('form', 'quick'),
('lemma', 'quick'),
('upostag', 'ADJ'),
('xpostag', 'JJ'),
('feats', OrderedDict([('Degree', 'Pos')])),
('head', 4),
('deprel', 'amod'),
('deps', None),
('misc', None)
]),
...
OrderedDict([
('id', 10),
('form', '.'),
('lemma', '.'),
('upostag', 'PUNCT'),
('xpostag', '.'),
('feats', None),
('head', 5),
('deprel', 'punct'),
('deps', None),
('misc', None)
])
]]
>>> parse_tree(data)
[[
TreeNode(
data=OrderedDict([
('id', 5),
('form', 'jumps'),
('lemma', 'jump'),
('upostag', 'VERB'),
('xpostag', 'VBZ'),
('feats', OrderedDict([
('Mood', 'Ind'),
('Number', 'Sing'),
('Person', '3'),
('Tense', 'Pres'),
('VerbForm', 'Fin')
])),
('head', 0),
('deprel', 'root'),
('deps', None),
('misc', None)]),
children=[
TreeNode(
data=OrderedDict([
('id', 4),
('form', 'fox'),
('lemma', 'fox'),
('upostag', 'NOUN'),
('xpostag', 'NN'),
('feats', OrderedDict([('Number', 'Sing')])),
('head', 5),
('deprel', 'nsubj'),
('deps', None),
('misc', None)
]),
children=[
TreeNode(
data=OrderedDict([
('id', 1),
('form', 'The'),
('lemma', 'the'),
('upostag', 'DET'),
('xpostag', 'DT'),
('feats', OrderedDict([('Definite', 'Def'), ('PronType', 'Art')])),
('head', 4),
('deprel', 'det'),
('deps', None),
('misc', None)
]),
children=[]
),
TreeNode(
data=OrderedDict([
('id', 2),
('form', 'quick'),
('lemma', 'quick'),
('upostag', 'ADJ'),
('xpostag', 'JJ'),
('feats', OrderedDict([('Degree', 'Pos')])),
('head', 4),
('deprel', 'amod'),
('deps', None),
('misc', None)
]),
children=[]
),
TreeNode(
data=OrderedDict([
('id', 3),
('form', 'brown'),
('lemma', 'brown'),
('upostag', 'ADJ'),
('xpostag', 'JJ'),
('feats', OrderedDict([('Degree', 'Pos')])),
('head', 4),
('deprel', 'amod'),
('deps', None),
('misc', None)
]),
children=[]
)
]
),
...
TreeNode(
data=OrderedDict([
('id', 10),
('form', '.'),
('lemma', '.'),
('upostag', 'PUNCT'),
('xpostag', '.'),
('feats', None),
('head', 5),
('deprel', 'punct'),
('deps', None),
('misc', None)
]),
children=[]
)
]
)
]]
>>> from conllu import print_tree
>>> for tree in parse_tree(data): print_tree(tree)
...
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
(deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
(deprel:det) form:The lemma:the upostag:DET [1]
(deprel:amod) form:quick lemma:quick upostag:ADJ [2]
(deprel:amod) form:brown lemma:brown upostag:ADJ [3]
(deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
(deprel:case) form:over lemma:over upostag:ADP [6]
(deprel:det) form:the lemma:the upostag:DET [7]
(deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
(deprel:punct) form:. lemma:. upostag:PUNCT [10]
NOTE: TreeNode is a namedtuple so you can loop over it as a normal tuple.
You can read about the CoNLL-U format at the Universial Dependencies project.
Develop locally and run the tests
git clone git@github.com:EmilStenstrom/conllu.git
cd conllu
pip install tox
./tox
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
conllu-0.10.7.tar.gz
(7.8 kB
view details)
Built Distribution
File details
Details for the file conllu-0.10.7.tar.gz
.
File metadata
- Download URL: conllu-0.10.7.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 311ae546c9360d6c71d4d2274172e83142b6b5e2a464b1519187a7259a491af3 |
|
MD5 | df6cf5daf925d0f321c52260bde09888 |
|
BLAKE2b-256 | 9c991eebbbc4a078015a9f30065d76d5f0e7a2d43bba67d038043aab40da9ac5 |
File details
Details for the file conllu-0.10.7-py2.py3-none-any.whl
.
File metadata
- Download URL: conllu-0.10.7-py2.py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd3a78cf8b1b7237db7350bd6f66986107664b638834e6504ff6db5690edb264 |
|
MD5 | e6315361f9d8b4d1e3119199524711dc |
|
BLAKE2b-256 | ed6caa6abaaf009872083233fff97a9bdf0cab5ccada9d90da7424868b879a83 |