pyconll

Read and manipulate CoNLL files

These details have been verified by PyPI

Project links

source

GitHub Statistics

Maintainers

matgrioni

These details have not been verified by PyPI

Project links

Project description

pyconll

Easily work with CoNLL files using the familiar syntax of python.

Installation

As with most python packages, simply use pip to install from PyPi.

pip install pyconll

pyconll is also usually available as a conda package on the pyconll channel. Packages in the version range [2.2.0, 4.0.0a0) are on conda. Version 4.0.0 will be added once conda adds full support for python 3.14

conda install -c pyconll pyconll

The pyconll 4.0 and newer start support at Python 3.14 and newer. Earlier releases cover other LTS Python versions but large changes introduced in 4.0 benefitted from the latest features in Python. As this release handles nearly all pending issues (and significantly increased flexibility), I do not expect another breaking release anytime soon, and typical LTS support should be expected again moving forward.

Use

This tool is intended to be a minimal, low level, expressive and pragmatic library in a widely used programming language. pyconll creates a thin API on top of raw CoNLL annotations that is simple and intuitive.

It offers the following features:

Regular CI testing and validation against all UD versions to ensure compatibility and correctness.
A typed API for better development experience and well defined semantics.
A focus on usability and simplicity in design (zero runtime dependencies)
The standard conllu parser is fast (usually 100-200% faster than other Python competitors), and allows for flexibility in choosing the runtime and memory tradeoffs via fast_conllu, conllu (and a future compact_conllu).
A flexible schema system for working with custom tabular formats.

See the following code example to understand the basics of the API.

# This snippet finds sentences where a token marked with part of speech 'AUX' are
# governed by a NOUN. For example, in French this is a less common construction
# and we may want to validate these examples because we have previously found some
# problematic examples of this construction.
from pyconll.conllu import conllu

train = conllu.load_from_file('./ud/train.conllu')

review_sentences = []

# The loaded data is a list of Sentences, and sentences contain tokens.
# Sentences also de/serialize comment information.
for sentence in train:
    # Build a token index for lookups
    token_by_id = {t.id: t for t in sentence.tokens}

    for token in sentence.tokens:
        # Tokens have attributes such as upos, head, id, deprel, etc.
        # We must check that the token is not the root token.
        if token.upos == 'AUX' and token.head != '0':
            head_token = token_by_id.get(token.head)
            if head_token and head_token.upos == 'NOUN':
                review_sentences.append(sentence)
                break

print('Review the following sentences:')
for sent in review_sentences:
    print(sent.meta['sent_id'])

The full API can be found in the documentation or follow the quick start guide for more examples.

Migrating to Version 4.0

Version 4.0 introduces significant architectural improvements that require some code changes. Here's how to migrate from earlier versions to 4.0:

Import Changes

Before:

import pyconll

corpus = pyconll.load_from_file('train.conllu')

After:

from pyconll.conllu import conllu

corpus = conllu.load_from_file('train.conllu')

Return Type Changes

The load_from_file and similar methods return list[Sentence[Token]] instead of a Conll object. The only purpose of Conll was to be sentence container and be able to serialize the entire corpus, which did not seem worth the abstraction on reflection, and is now better handled by the Format class.

Before:

corpus = pyconll.load_from_file('train.conllu')  # Returns Conll object
for sentence in corpus:  # Conll is MutableSequence
    pass

After:

corpus = conllu.load_from_file('train.conllu')  # Returns list[Sentence]
for sentence in corpus:  # Standard Python list
    pass

Sentence Changes

Sentences no longer support indexing tokens by ID. The motivation here was avoiding an extra data structure that may not be needed for the use case. This mapping can be easily created via a one-line dict comprehension so there was not a huge complexity benefit either.

Before:

for sentence in corpus:
    for token in sentence:
        if token.head != '0':
            head_token = sentence[token.head]  # Direct ID lookup

After:

for sentence in corpus:
    token_by_id = {t.id: t for t in sentence.tokens}
    for token in sentence.tokens:
        if token.head != '0':
            head_token = token_by_id[token.head]

Metadata access also changed. There are no longer special metadata fields of id and text since the semantics on modification were unclear. The general principles of singleton keys and whitespace trimming remain the same. The meta member on a sentence supports MutableMapping so the metadata changes can be written directly to this object.

Before:

sentence_id = sentence.id
sentence_text = sentence.text

After:

sentence_id = sentence.meta['sent_id']
sentence_text = sentence.meta['text']

Serialization Changes

Serialization no longer uses .conll() methods. Use WriteFormat methods instead:

Before:

# Serialize to string
conll_string = corpus.conll()

# Write to file
with open('output.conllu', 'w') as f:
    corpus.write(f)

After:

# Serialize individual items
from pyconll.conllu import conllu

# Write to file
with open('output.conllu', 'w', encoding='utf-8') as f:
    conllu.write_corpus(corpus, f)

Custom Formats (New in 4.0)

Version 4.0 introduced a flexible schema system that allows you to define custom token formats beyond CoNLL-U. This makes it possible to work with CoNLL-X, CoNLL 2006, or any other column-based format by defining your own token schema and creating a Format instance. Since I do not currently have much experience with these other formats I have not pre-added them to this library, but my expectation is that these definitions will be added over time. This way the same way there is from pyconll.conllu import conllu there will also be from pyconll.conllx import conllx.

from typing import Optional
from pyconll.format import Format
from pyconll.schema import field, nullable, tokenspec, unique_array
from pyconll.shared import Sentence

@tokenspec
class MyToken:
    id: int
    form: Optional[str] = field(nullable(str, "_"))
    lemma: Optional[str] = field(nullable(str, "_"))
    pos: str
    head: int
    deprel: str
    feats: set[str] = field(unique_array(str, ",", "_"))
my_format = Format(MyToken, Sentence[MyToken], delimiter='\t')

token_line = "3\ttest\t_\tNOUN\t2\tAUX\tfeat1,feat2"
first_token: MyToken = my_format.parse_token(token_line)
assert ((first_token.id, first_token.form, first_token.lemma, first_token.feats) == (3, 'test', None, {"feat1", "feat2"}))

empty_feats_token_line = "4\tanother\t_\tNOUN\t2\tAUX\t_"
second_token: MyToken = my_format.parse_token(empty_feats_token_line)
assert ((first_token.id, first_token.form, first_token.lemma, first_token.feats) == (4, 'another', None, {}))

sentences: list[Sentence[MyToken]] = my_format.load_from_file('data.conll')

For more details, see the documentation or samples in the examples folder.

Contributing

Contributions to this project are welcome and encouraged! If you are unsure how to contribute, here is a guide from Github explaining the basic workflow. After cloning this repo, please run pip install -r requirements.txt to properly setup locally.

Release Checklist

Below enumerates the general release process explicitly. This section is for internal use and most people do not have to worry about this. First note, that the dev branch is always a direct extension of master with the latest changes since the last release. That is, it is essentially a staging release branch.

Change the version in pyconll/_version appropriately.
Merge dev into master locally. Github does not offer a fast forward merge and explicitly uses --no-ff. So to keep the linear nature of changes, merge locally to fast forward. This is assuming that the dev branch looks good on CI tests which do not automatically run in this situation.
Push the master branch. This should start some CI tests specifically for master. After validating these results, create a tag corresponding to the next version number and push the tag.
Create a new release from this tag from the Releases page. On creating this release, two workflows will start. One releases to pypi, and the other releases to conda.
Validate these workflows pass, and the package is properly released on both platforms.

Project details

These details have been verified by PyPI

Project links

source

GitHub Statistics

Maintainers

matgrioni

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

4.0.0

Dec 5, 2025

4.0.0b3 pre-release

Dec 3, 2025

4.0.0b2 pre-release

Nov 14, 2025

4.0.0b1 pre-release

Nov 13, 2025

3.3.1

Oct 11, 2025

3.3.0

Oct 11, 2025

3.2.0

Jun 21, 2023

3.1.0

Jun 3, 2021

3.1.0.dev3 pre-release

Oct 25, 2021

3.1.0.dev2 pre-release

Oct 24, 2021

3.1.0.dev1 pre-release

Oct 24, 2021

3.0.5

May 30, 2021

3.0.4

Feb 25, 2021

3.0.3

Feb 24, 2021

3.0.2

Feb 24, 2021

3.0.0

Feb 24, 2021

2.3.3

Oct 26, 2020

2.3.1

Oct 6, 2020

2.2.1

Nov 18, 2019

2.2.0

Oct 2, 2019

2.1.1

Sep 5, 2019

2.1.0

Aug 31, 2019

2.0.0

May 10, 2019

1.1.4

Apr 16, 2019

1.1.3

Jan 4, 2019

1.1.2

Dec 28, 2018

1.1.1

Dec 11, 2018

1.1.0

Nov 12, 2018

1.0.1

Sep 15, 2018

1.0

Sep 15, 2018

0.3.1

Aug 9, 2018

0.3

Jul 28, 2018

0.2.3

Jul 24, 2018

0.2.2

Jul 24, 2018

0.2.1

Jul 18, 2018

0.2

Jul 17, 2018

0.1.1

Jul 15, 2018

0.1

Jul 5, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyconll-4.0.0.tar.gz (28.9 kB view details)

Uploaded Dec 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyconll-4.0.0-py3-none-any.whl (28.1 kB view details)

Uploaded Dec 5, 2025 Python 3

File details

Details for the file pyconll-4.0.0.tar.gz.

File metadata

Download URL: pyconll-4.0.0.tar.gz
Upload date: Dec 5, 2025
Size: 28.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyconll-4.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8d9fd27d3d7f3d92bc51ccb289403f6773ec8438a891cb20dc9f6bc050c0efd8`
MD5	`8d5c986cdee8f8de26be71040ba93c2d`
BLAKE2b-256	`8f044ba09959a7524ecbd7a5c87e1cc1de47c5ab20afaf216977432802ba6229`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyconll-4.0.0.tar.gz:

Publisher: pypi-release.yaml on pyconll/pyconll

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyconll-4.0.0.tar.gz
- Subject digest: 8d9fd27d3d7f3d92bc51ccb289403f6773ec8438a891cb20dc9f6bc050c0efd8
- Sigstore transparency entry: 743647824
- Sigstore integration time: Dec 5, 2025
Source repository:
- Permalink: pyconll/pyconll@e44468922fac390e7593e2d54cecfd8b3bf9e509
- Branch / Tag: refs/tags/4.0.0
- Owner: https://github.com/pyconll
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-release.yaml@e44468922fac390e7593e2d54cecfd8b3bf9e509
- Trigger Event: release

File details

Details for the file pyconll-4.0.0-py3-none-any.whl.

File metadata

Download URL: pyconll-4.0.0-py3-none-any.whl
Upload date: Dec 5, 2025
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyconll-4.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f2c08ff80c50b275ad47fd021cdcb99309c830a251fe1123d94a454722be07e`
MD5	`e0a3009866daa5a14c9f965b3ab97c3b`
BLAKE2b-256	`7de70339ba22bd524c3fdc6bebbb35da4282275a4c09c51ee5032e6aa30ad841`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyconll-4.0.0-py3-none-any.whl:

Publisher: pypi-release.yaml on pyconll/pyconll

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyconll-4.0.0-py3-none-any.whl
- Subject digest: 3f2c08ff80c50b275ad47fd021cdcb99309c830a251fe1123d94a454722be07e
- Sigstore transparency entry: 743647825
- Sigstore integration time: Dec 5, 2025
Source repository:
- Permalink: pyconll/pyconll@e44468922fac390e7593e2d54cecfd8b3bf9e509
- Branch / Tag: refs/tags/4.0.0
- Owner: https://github.com/pyconll
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-release.yaml@e44468922fac390e7593e2d54cecfd8b3bf9e509
- Trigger Event: release

pyconll 4.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyconll

Links

Installation

Use

Migrating to Version 4.0

Import Changes

Return Type Changes

Sentence Changes

Serialization Changes

Custom Formats (New in 4.0)

Contributing

Release Checklist

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance