Skip to main content

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary

Project description

# CoNLL-U Parser

**CoNLL-U Parser** parses a [CoNLL-U formatted](http://universaldependencies.org/format.html) string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.

## Why should you use conllu?

- It's simple. ~150 lines of code (including whitespace).
- Works with both Python 2 and Python 3
- It has no dependencies
- Nice set of tests with CI setup: ![Build status on Travis](https://api.travis-ci.org/EmilStenstrom/conllu.svg?branch=master)
- It has [lots of downloads](http://pepy.tech/project/conllu)

## Installation

```bash
pip install conllu
```

## Example usage

```python
>>> from conllu import parse, parse_tree
>>> data = """
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _

"""

>>> # GitHub replaces tab characters with spaces so for this code to be copy-pastable
>>> # I've added the following two lines. You don't need them in your code
>>> import re
>>> data = re.sub(r" +", r"\t", data)

>>> parse(data)
[[
OrderedDict([
('id', 1),
('form', 'The'),
('lemma', 'the'),
('upostag', 'DET'),
('xpostag', 'DT'),
('feats', OrderedDict([('Definite', 'Def'), ('PronType', 'Art')])),
('head', 4),
('deprel', 'det'),
('deps', None),
('misc', None)
]),
OrderedDict([
('id', 2),
('form', 'quick'),
('lemma', 'quick'),
('upostag', 'ADJ'),
('xpostag', 'JJ'),
('feats', OrderedDict([('Degree', 'Pos')])),
('head', 4),
('deprel', 'amod'),
('deps', None),
('misc', None)
]),
...
OrderedDict([
('id', 10),
('form', '.'),
('lemma', '.'),
('upostag', 'PUNCT'),
('xpostag', '.'),
('feats', None),
('head', 5),
('deprel', 'punct'),
('deps', None),
('misc', None)
])
]]

>>> parse_tree(data)
[[
TreeNode(
data=OrderedDict([
('id', 5),
('form', 'jumps'),
('lemma', 'jump'),
('upostag', 'VERB'),
('xpostag', 'VBZ'),
('feats', OrderedDict([
('Mood', 'Ind'),
('Number', 'Sing'),
('Person', '3'),
('Tense', 'Pres'),
('VerbForm', 'Fin')
])),
('head', 0),
('deprel', 'root'),
('deps', None),
('misc', None)]),
children=[
TreeNode(
data=OrderedDict([
('id', 4),
('form', 'fox'),
('lemma', 'fox'),
('upostag', 'NOUN'),
('xpostag', 'NN'),
('feats', OrderedDict([('Number', 'Sing')])),
('head', 5),
('deprel', 'nsubj'),
('deps', None),
('misc', None)
]),
children=[
TreeNode(
data=OrderedDict([
('id', 1),
('form', 'The'),
('lemma', 'the'),
('upostag', 'DET'),
('xpostag', 'DT'),
('feats', OrderedDict([('Definite', 'Def'), ('PronType', 'Art')])),
('head', 4),
('deprel', 'det'),
('deps', None),
('misc', None)
]),
children=[]
),
TreeNode(
data=OrderedDict([
('id', 2),
('form', 'quick'),
('lemma', 'quick'),
('upostag', 'ADJ'),
('xpostag', 'JJ'),
('feats', OrderedDict([('Degree', 'Pos')])),
('head', 4),
('deprel', 'amod'),
('deps', None),
('misc', None)
]),
children=[]
),
TreeNode(
data=OrderedDict([
('id', 3),
('form', 'brown'),
('lemma', 'brown'),
('upostag', 'ADJ'),
('xpostag', 'JJ'),
('feats', OrderedDict([('Degree', 'Pos')])),
('head', 4),
('deprel', 'amod'),
('deps', None),
('misc', None)
]),
children=[]
)
]
),
...
TreeNode(
data=OrderedDict([
('id', 10),
('form', '.'),
('lemma', '.'),
('upostag', 'PUNCT'),
('xpostag', '.'),
('feats', None),
('head', 5),
('deprel', 'punct'),
('deps', None),
('misc', None)
]),
children=[]
)
]
)
]]

>>> from conllu import print_tree
>>> for tree in parse_tree(data): print_tree(tree)
...
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
(deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
(deprel:det) form:The lemma:the upostag:DET [1]
(deprel:amod) form:quick lemma:quick upostag:ADJ [2]
(deprel:amod) form:brown lemma:brown upostag:ADJ [3]
(deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
(deprel:case) form:over lemma:over upostag:ADP [6]
(deprel:det) form:the lemma:the upostag:DET [7]
(deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
(deprel:punct) form:. lemma:. upostag:PUNCT [10]
```

NOTE: TreeNode is a namedtuple so you can loop over it as a normal tuple.

You can read about the CoNLL-U format at the [Universial Dependencies project](http://universaldependencies.org/format.html).

## Develop locally and run the tests

```bash
git clone git@github.com:EmilStenstrom/conllu.git
cd conllu
python setup.py test
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conllu-0.10.2.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

conllu-0.10.2-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file conllu-0.10.2.tar.gz.

File metadata

  • Download URL: conllu-0.10.2.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for conllu-0.10.2.tar.gz
Algorithm Hash digest
SHA256 4a8faf71920a36ab0189ca0534885d23ce68bd01035993a99dcd288bbf817805
MD5 be5a3f38aeb022cbd690c18d90f5e5d8
BLAKE2b-256 c05661b8639d39399ace02b185dd9a11e2fea214dc5fa7ecbe03d7b63ba43ca2

See more details on using hashes here.

File details

Details for the file conllu-0.10.2-py3-none-any.whl.

File metadata

File hashes

Hashes for conllu-0.10.2-py3-none-any.whl
Algorithm Hash digest
SHA256 33e13faea4f2d05e3bf8a46034cf8ec38da96a554d8068d6ebfc459afa3f6885
MD5 4ced00d48ddfdb46bb2e8ccd4dd5d6a1
BLAKE2b-256 6967ef441aceca1a95214080b7fbc141f1b09de544df42899e1bfe8011d66277

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page