Helper for converting CONLLU files and uploading the corpus to LiRI Corpus Platform (LCP)

These details have not been verified by PyPI

Project links

Project description

LCP CLI module

Command-line tool for converting CONLLU files and uploading the corpus to LCP

Installation

Make sure you have python 3.11+ with pip installed in your local environment, then run:

pip install lcpcli==0.2.8

Usage

Example:

Corpus conversion:

lcpcli -i ~/conll_ext/ -o ~/upload/

Data upload:

lcpcli -c ~/upload/ -k $API_KEY -s $API_SECRET -p my_project --live

Including --live points the upload to the live instance of LCP. Leave it out if you want to add a corpus to an instance of LCP running on localhost.

Help:

lcpcli --help

lcpcli takes a corpus of CoNLL-U (PLUS) files and imports it to a project created in an LCP instance, such as catchphrase.

Besides the standard token-level CoNLL-U fields (form, lemma, upos, xpos, feats, head, deprel, deps) one can also provide document- and sentence-level annotations using comment lines in the files (see the CoNLL-U Format section).

Example corpus

lcpcli ships with an example one-video "corpus": the video is an excerpt from the CC-BY 3.0 "Big Buck Bunny" video ((c) copyright 2008, Blender Foundation / www.bigbuckbunny.org) and the "transcription" is a sample of the Declaration of the Human Rights

To populate a folder with the example data, use this command

lcpcli --example /destination/folder/

This will create a subfolder named free_video_corpus in /destination/folder which, itself, contains two subfolders: input and output. The input subfolder contains four files:

doc.conllu is a CoNLL-U Plus file that contains the textual data, with time alignments in seconds at the token- (start and end in the MISC column), segment- (# start = and # end = comments) and document-level (#newdoc start = and #newdoc end =)
namedentity.csv is a comma-separated value lookup file that contains information about the named entities, where each row associates an ID reported in the namedentity token cells of doc.conllu with two attributes, type and form
shot.csv is a comma-separated value file that defines time-aligned annotations about the shots in the video in the view column, where the start and end columns are timestamps, in seconds, relative to the document referenced in the doc_id column
meta.json is a JSON file that defines the structure of the corpus, used both for pre-processing the data before upload, and when adding the data to the LCP database. Read on for information on the definitions in this file

CoNLL-U Format

The CoNLL-U format is documented at: https://universaldependencies.org/format.html

The LCP CLI converter will treat all the comments that start with # newdoc KEY = VALUE as document-level attributes. This means that if a CoNLL-U file contains the line # newdoc author = Jane Doe, then in LCP all the sentences from this file will be associated with a document whose meta attribute will contain author: 'Jane Doe'.

All other comment lines following the format # key = value will add an entry to the meta attribute of the segment corresponding to the sentence below that line (i.e. not at the document level).

The key-value pairs in the MISC column of a token line will go in the meta attribute of the corresponding token, with the exceptions of these key-value combinations:

SpaceAfter=Yes vs. SpaceAfter=No (case senstive) controls whether the token will be represented with a trailing space character in the database
start=n.m|end=o.p (case senstive) will align tokens, segments (sentences) and documents along a temporal axis, where n.m and o.p should be floating values in seconds

See below how to report all the attributes in the template .json file.

CoNLL-U Plus

CoNLL-U Plus is an extension to the CoNLLU-U format documented at: https://universaldependencies.org/ext-format.html

If your files start with a comment line of the form # global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC, lcpcli will treat them as CoNLL-U PLUS files and process the columns according to the names you set in that line.

Media files

If your corpus includes media files, your .json template should report it under a mediaSlots key in meta, e.g.:

"meta": {
    "name": "Free Single-Video Corpus",
    "author": "LiRI",
    "date": "2024-06-13",
    "version": 1,
    "corpusDescription": "Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities",
    "mediaSlots": {
        "video": {
            "mediaType": "video",
            "isOptional": false
        }
    }
},

Your CoNLL-U file(s) should accordingly report each document's media file's name in a comment, like so:

# newdoc video = bunny.mp4

The .json template should also define a main key named tracks to control what annotations will be represented along the time axis. For example the following will tell the interface to display separate timeline tracks for the shot, named entity and segment annotations, with the latter being subdivided in as many tracks as there are distinct values for the attribute speaker of the segments:

"tracks": {
    "layers": {
        "Shot": {},
        "NamedEntity": {},
        "Segment": {
            "split": [
                "speaker"
            ]
        }
    }
}

Finally, your output corpus folder should include a subfolder named media in which all the referenced media files have been placed

Attribute types

The values of each attribute (on tokens, segments, documents or at any other level) have a type; the most common types are text, number or categorical. The attributes must be reported in the template .json file, along with their type (you can see an example in the section Convert and Upload)

text vs categorical: while both types correspond to alpha-numerical values, categorical is meant for attributes that have a limited number of possible values (typically, less than 100 distinct values) of a limited length (as a rule of thumb, each value can have up to 50 characters). There is no such limits on values of attributes of type text. When a user starts typing a constraint on an attribute of type categorical, the DQD editor will offer autocompletition suggestions. The attributes of type text will have their values listed in a dedicated table (lcpcli's conversion step produces corresponding .csv files) so a query that expresses a constraint on an attribute will be slower if that attribute if of type text than of type categorical
the type labels (with an s at the end) corresponds to a set of labels that users will be able to constrain in DQD using the contain keyword: for example, if an attribute named genre is of type labels, the user could write a constraint like genre contain 'drama' or hobbies !contain 'comedy'. The values of attributes of type labels should be one-line strings, with each value separated by a comma (,) character (as in, e.g., # newdoc genre = drama, romance, coming of age, fiction); as a consequence, no label can contain the character ,.
the type dict corresponds to key-values pairs as represented in JSON
the type date requires values to be formatted in a way that can be parsed by PostgreSQL

Convert and Upload

Create a directory in which you have all your properly-fromatted CONLLU files.
In the same directory, create a template .json file that describes your corpus structure (see above about the attributes key on Document and Segment), for example:

{
    "meta": {
        "name": "Free Single-Video Corpus",
        "author": "LiRI",
        "date": "2024-06-13",
        "version": 1,
        "corpusDescription": "Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities",
        "mediaSlots": {
            "video": {
                "mediaType": "video",
                "isOptional": false
            }
        }
    },
    "firstClass": {
        "document": "Document",
        "segment": "Segment",
        "token": "Token"
    },
    "layer": {
        "Token": {
            "abstract": false,
            "layerType": "unit",
            "anchoring": {
                "location": false,
                "stream": true,
                "time": true
            },
            "attributes": {
                "form": {
                    "isGlobal": false,
                    "type": "text",
                    "nullable": true
                },
                "lemma": {
                    "isGlobal": false,
                    "type": "text",
                    "nullable": false
                },
                "upos": {
                    "isGlobal": true,
                    "type": "categorical",
                    "nullable": true
                },
                "xpos": {
                    "isGlobal": false,
                    "type": "categorical",
                    "nullable": true
                },
                "ufeat": {
                    "isGlobal": false,
                    "type": "dict",
                    "nullable": true
                }
            }
        },
        "DepRel": {
            "abstract": true,
            "layerType": "relation",
            "attributes": {
                "udep": {
                    "type": "categorical",
                    "isGlobal": true,
                    "nullable": false
                },
                "source": {
                    "name": "dependent",
                    "entity": "Token",
                    "nullable": false
                },
                "target": {
                    "name": "head",
                    "entity": "Token",
                    "nullable": true
                },
                "left_anchor": {
                    "type": "number",
                    "nullable": false
                },
                "right_anchor": {
                    "type": "number",
                    "nullable": false
                }
            }
        },
        "NamedEntity": {
            "abstract": false,
            "layerType": "span",
            "contains": "Token",
            "anchoring": {
                "location": false,
                "stream": true,
                "time": false
            },
            "attributes": {
                "form": {
                    "isGlobal": false,
                    "type": "text",
                    "nullable": false
                },
                "type": {
                    "isGlobal": false,
                    "type": "categorical",
                    "nullable": true
                }
            }
        },
        "Shot": {
            "abstract": false,
            "layerType": "span",
            "anchoring": {
                "location": false,
                "stream": false,
                "time": true
            },
            "attributes": {
                "view": {
                    "isGlobal": false,
                    "type": "categorical",
                    "nullable": false
                }
            }
        },
        "Segment": {
            "abstract": false,
            "layerType": "span",
            "contains": "Token",
            "attributes": {
                "meta": {
                    "text": {
                        "type": "text"
                    },
                    "start": {
                        "type": "text"
                    },
                    "end": {
                        "type": "text"
                    }
                }
            }
        },
        "Document": {
            "abstract": false,
            "contains": "Segment",
            "layerType": "span",
            "attributes": {
                "meta": {
                    "audio": {
                        "type": "text",
                        "isOptional": true
                    },
                    "video": {
                        "type": "text",
                        "isOptional": true
                    },
                    "start": {
                        "type": "number"
                    },
                    "end": {
                        "type": "number"
                    },
                    "name": {
                        "type": "text"
                    }
                }
            }
        }
    },
    "tracks": {
        "layers": {
            "Shot": {},
            "Segment": {},
            "NamedEntity": {}
        }
    }
}

If your corpus defines a character-anchored entity type such as named entities, make sure you also include a properly named and formatted CSV file for it in the directory.
Visit an LCP instance (e.g. catchphrase) and create a new project if you don't already have one where your corpus should go.
Retrieve the API key and secret for your project by clicking on the button that says: "Create API Key".
Once you have your API key and secret, you can start converting and uploading your corpus by running the following command:

lcpcli -i $CONLLU_FOLDER -o $OUTPUT_FOLDER -k $API_KEY -s $API_SECRET -p $PROJECT_NAME --live

$CONLLU_FOLDER should point to the folder that contains your CONLLU files
$OUTPUT_FOLDER should point to another folder that will be used to store the converted files to be uploaded
$API_KEY is the key you copied from your project on LCP (still visible when you visit the page)
$API_SECRET is the secret you copied from your project on LCP (only visible upon API Key creation)
$PROJECT_NAME is the name of the project exactly as displayed on LCP -- it is case-sensitive, and space characters should be escaped

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.8

Oct 24, 2025

0.2.7

Sep 8, 2025

0.2.6

Aug 19, 2025

0.2.5

Jun 19, 2025

0.2.4

Jun 13, 2025

0.2.3

Jun 13, 2025

0.2.2

May 26, 2025

0.2.1

Mar 28, 2025

0.2.0

Mar 28, 2025

0.1.9

Mar 17, 2025

0.1.8

Dec 9, 2024

0.1.7

Sep 11, 2024

0.1.6

Jun 28, 2024

0.1.5

May 23, 2024

0.1.4

May 17, 2024

0.1.3

Apr 29, 2024

0.1.2

Mar 20, 2024

0.1.1

Mar 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lcpcli-0.2.8.tar.gz (10.6 MB view details)

Uploaded Oct 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lcpcli-0.2.8-py3-none-any.whl (10.6 MB view details)

Uploaded Oct 24, 2025 Python 3

File details

Details for the file lcpcli-0.2.8.tar.gz.

File metadata

Download URL: lcpcli-0.2.8.tar.gz
Upload date: Oct 24, 2025
Size: 10.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for lcpcli-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`7a7d5a455c8899167b029d718e6083a8bb2abde6838cee8b43c98a0070e9a7b0`
MD5	`3910f63a2055c18a5a4059fa28e36830`
BLAKE2b-256	`20b639a53234177daef83f90bfbff8c32c40802fc8a7906f4e0a8784f105152e`

See more details on using hashes here.

File details

Details for the file lcpcli-0.2.8-py3-none-any.whl.

File metadata

Download URL: lcpcli-0.2.8-py3-none-any.whl
Upload date: Oct 24, 2025
Size: 10.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for lcpcli-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c062ba2b0434183cecd51c001c1b39b7fe63d4142eab5ff874da3df714c4a90`
MD5	`69802bcb1493ace72e005c764d4853ee`
BLAKE2b-256	`487e0468f3470fc66ef2662f940a99e935f19306c5fa9e278a40a9ec16830faa`

See more details on using hashes here.

lcpcli 0.2.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LCP CLI module

Installation

Usage

Example corpus

CoNLL-U Format

CoNLL-U Plus

Media files

Attribute types

Convert and Upload

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes