Skip to main content

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary

Project description

CoNLL-U Parser

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.

Why should you use conllu?

  • It's simple. ~150 lines of code (including whitespace).
  • Works with both Python 2 and Python 3
  • It has no dependencies
  • Nice set of tests with CI setup: Build status on Travis
  • It has 100% test coverage
  • It has lots of downloads

Installation

pip install conllu

Or, if you are using conda:

conda install -c conda-forge conllu

Example usage

>>> from conllu import parse, parse_tree
>>> data = """
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""

>>> # GitHub replaces tab characters with spaces so for this code to be copy-pastable
>>> # I've added the following two lines. You don't need them in your code
>>> import re
>>> data = re.sub(r" +", r"\t", data)

>>> parse(data)
[[
    OrderedDict([
        ('id', 1),
        ('form', 'The'),
        ('lemma', 'the'),
        ('upostag', 'DET'),
        ('xpostag', 'DT'),
        ('feats', OrderedDict([('Definite', 'Def'), ('PronType', 'Art')])),
        ('head', 4),
        ('deprel', 'det'),
        ('deps', None),
        ('misc', None)
    ]),
    OrderedDict([
        ('id', 2),
        ('form', 'quick'),
        ('lemma', 'quick'),
        ('upostag', 'ADJ'),
        ('xpostag', 'JJ'),
        ('feats', OrderedDict([('Degree', 'Pos')])),
        ('head', 4),
        ('deprel', 'amod'),
        ('deps', None),
        ('misc', None)
    ]),
    ...
    OrderedDict([
        ('id', 10),
        ('form', '.'),
        ('lemma', '.'),
        ('upostag', 'PUNCT'),
        ('xpostag', '.'),
        ('feats', None),
        ('head', 5),
        ('deprel', 'punct'),
        ('deps', None),
        ('misc', None)
    ])
]]

>>> parse_tree(data)
[[
    TreeNode(
        data=OrderedDict([
            ('id', 5),
            ('form', 'jumps'),
            ('lemma', 'jump'),
            ('upostag', 'VERB'),
            ('xpostag', 'VBZ'),
            ('feats', OrderedDict([
                ('Mood', 'Ind'),
                ('Number', 'Sing'),
                ('Person', '3'),
                ('Tense', 'Pres'),
                ('VerbForm', 'Fin')
            ])),
            ('head', 0),
            ('deprel', 'root'),
            ('deps', None),
            ('misc', None)]),
        children=[
            TreeNode(
                data=OrderedDict([
                    ('id', 4),
                    ('form', 'fox'),
                    ('lemma', 'fox'),
                    ('upostag', 'NOUN'),
                    ('xpostag', 'NN'),
                    ('feats', OrderedDict([('Number', 'Sing')])),
                    ('head', 5),
                    ('deprel', 'nsubj'),
                    ('deps', None),
                    ('misc', None)
                ]),
                children=[
                    TreeNode(
                        data=OrderedDict([
                            ('id', 1),
                            ('form', 'The'),
                            ('lemma', 'the'),
                            ('upostag', 'DET'),
                            ('xpostag', 'DT'),
                            ('feats', OrderedDict([('Definite', 'Def'), ('PronType', 'Art')])),
                            ('head', 4),
                            ('deprel', 'det'),
                            ('deps', None),
                            ('misc', None)
                        ]),
                        children=[]
                    ),
                    TreeNode(
                        data=OrderedDict([
                            ('id', 2),
                            ('form', 'quick'),
                            ('lemma', 'quick'),
                            ('upostag', 'ADJ'),
                            ('xpostag', 'JJ'),
                            ('feats', OrderedDict([('Degree', 'Pos')])),
                            ('head', 4),
                            ('deprel', 'amod'),
                            ('deps', None),
                            ('misc', None)
                        ]),
                        children=[]
                    ),
                    TreeNode(
                        data=OrderedDict([
                            ('id', 3),
                            ('form', 'brown'),
                            ('lemma', 'brown'),
                            ('upostag', 'ADJ'),
                            ('xpostag', 'JJ'),
                            ('feats', OrderedDict([('Degree', 'Pos')])),
                            ('head', 4),
                            ('deprel', 'amod'),
                            ('deps', None),
                            ('misc', None)
                        ]),
                        children=[]
                    )
                ]
            ),
            ...
            TreeNode(
                data=OrderedDict([
                    ('id', 10),
                    ('form', '.'),
                    ('lemma', '.'),
                    ('upostag', 'PUNCT'),
                    ('xpostag', '.'),
                    ('feats', None),
                    ('head', 5),
                    ('deprel', 'punct'),
                    ('deps', None),
                    ('misc', None)
                ]),
                children=[]
            )
        ]
    )
]]

>>> from conllu import print_tree
>>> for tree in parse_tree(data): print_tree(tree)
...
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
    (deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
        (deprel:det) form:The lemma:the upostag:DET [1]
        (deprel:amod) form:quick lemma:quick upostag:ADJ [2]
        (deprel:amod) form:brown lemma:brown upostag:ADJ [3]
    (deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
        (deprel:case) form:over lemma:over upostag:ADP [6]
        (deprel:det) form:the lemma:the upostag:DET [7]
        (deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
    (deprel:punct) form:. lemma:. upostag:PUNCT [10]


>>> # Parse CoNLL-U formatted string with metadata in comments
>>> from conllu import parse_with_comments
>>> data = """
# title = text1
# text = The quick brown fox jumps over the lazy 
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

""" 

>>> parse_with_comments(syntaxnet_text.decode('utf-8'))
[OrderedDict([
    ('metadata', OrderedDict([
        ('sent_id', '1'),
        ('text', 'The quick brown fox jumps over the lazy dog.')
    ])),
    ('lemmas', [
        OrderedDict([
            ('id', 1),
            ('form', 'The'),
            ('lemma', 'the'),
            ('upostag', 'DET'),
            ('xpostag', 'DT'),
            ('feats', OrderedDict([('Definite', 'Def'), ('PronType', 'Art')])),
            ('head', 4),
            ('deprel', 'det'),
            ('deps', None),
            ('misc', None)
        ]),
        OrderedDict([
            ('id', 2),
            ('form', 'quick'),
            ('lemma', 'quick'),
            ('upostag', 'ADJ'),
            ('xpostag', 'JJ'),
            ('feats', OrderedDict([('Degree', 'Pos')])),
            ('head', 4),
            ('deprel', 'amod'),
            ('deps', None),
            ('misc', None)
        ]),
        ...
        OrderedDict([
            ('id', 10),
            ('form', '.'),
            ('lemma', '.'),
            ('upostag', 'PUNCT'),
            ('xpostag', '.'),
            ('feats', None),
            ('head', 5),
            ('deprel', 'punct'),
            ('deps', None),
            ('misc', None)
        ])
    ]),

])]

NOTE: TreeNode is a namedtuple so you can loop over it as a normal tuple.

You can read about the CoNLL-U format at the Universial Dependencies project.

Develop locally and run the tests

git clone git@github.com:EmilStenstrom/conllu.git
cd conllu
pip install tox
./tox

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conllu-0.11.tar.gz (9.2 kB view hashes)

Uploaded Source

Built Distribution

conllu-0.11-py2.py3-none-any.whl (6.8 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page