Skip to main content

Helper for converting CONLLU files and uploading the corpus to LiRI Corpus Platform (LCP)

Project description

LCP CLI module

Command-line tool for converting CONLLU files and uploading the corpus to LCP

Installation

Make sure you have python 3.11+ with pip installed in your local environment, then run:

pip install lcpcli

Usage

Examples:

Conversion of a CoNLL-U (Plus) corpus:

lcpcli -i ~/conll_ext/ -o ~/upload/

Data upload:

lcpcli -c ~/upload/ -k $API_KEY -s $API_SECRET -p "my project" --live

Including --live points the upload to the live instance of LCP. Leave it out if you want to add a corpus to an instance of LCP running on localhost.

Help:

lcpcli --help

lcpcli can take a corpus of CoNLL-U (PLUS) files and import it to a collection created on LCP.

Besides the standard token-level CoNLL-U fields (form, lemma, upos, xpos, feats, head, deprel, deps) one can also provide document-, paragraph- and sentence-level annotations using comment lines in the files (see the CoNLL-U Format section).

CoNLL-U Format

The CoNLL-U format is documented at: https://universaldependencies.org/format.html

The LCP CLI converter will treat all the comments that start with # newdoc KEY = VALUE as document-level attributes, and all the comments that start with # newpar KEY = VALUE as paragraph-level attributes. All other comment lines following the format # key = value will be treated sentence-level attributes.

The key-value pairs in the FEATS and MISC columns of a token line will be mapped to corresponding attributes in the LCP corpus. Additionally, if the MISC cell includes SpaceAfter=Yes or SpaceAfter=No (case senstive) the token will be represented with (respectively, without) a trailing space character in the database.

CoNLL-U Plus

CoNLL-U Plus is an extension to the CoNLLU-U format documented at: https://universaldependencies.org/ext-format.html

If your files start with a comment line of the form # global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC, lcpcli will treat them as CoNLL-U PLUS files and process the columns according to the names you set in that line.

CoNLL-U conversion and upload

  1. Create a directory in which you have all your properly-fromatted CoNLL-U files.

  2. Visit an LCP instance (e.g. catchphrase) and create a new collection if you don't already have one where your corpus should go.

  3. Retrieve the API key and secret for your project by clicking on the button that says: "Create API Key".

  4. Once you have your API key and secret, you can start converting and uploading your corpus by running the following command:

lcpcli -i $CONLLU_FOLDER -o $OUTPUT_FOLDER -k $API_KEY -s $API_SECRET -p $PROJECT_NAME --live
  • $CONLLU_FOLDER should point to the folder that contains your CONLLU files
  • $OUTPUT_FOLDER should point to another folder that will be used to store the converted files to be uploaded
  • $API_KEY is the key you copied from your project on LCP (still visible when you visit the page)
  • $API_SECRET is the secret you copied from your project on LCP (only visible upon API Key creation)
  • $PROJECT_NAME is the name of the project exactly as displayed on LCP -- it is case-sensitive, and space characters should be escaped

Other input formats, rich data

Previous versions of lcpcli defined procedures to include rich annotations in CoNLL-U files, including time-anchored media files, in combination with annex non-CoNLL-U files. These methods are no longer supported -- use an older version of lcpcli if you require those features.

lcpcli now ships with a Python module called lcpcli.builder that you can use to convert any input format. The default CoNLL-U converter included in lcpcli uses lcpcli.builder under the hood.

You can find a short tutorial on how to use the module in BUILDER.md. Further information can be found in the LCP documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lcpcli-0.3.0.tar.gz (10.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lcpcli-0.3.0-py3-none-any.whl (10.6 MB view details)

Uploaded Python 3

File details

Details for the file lcpcli-0.3.0.tar.gz.

File metadata

  • Download URL: lcpcli-0.3.0.tar.gz
  • Upload date:
  • Size: 10.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for lcpcli-0.3.0.tar.gz
Algorithm Hash digest
SHA256 e10f8ecbd246c478e8651ed9e317a0163b262fe3e640070d7a65f4db9e0c397c
MD5 9e1f856a520c78fdfd09db7757b91e61
BLAKE2b-256 247149b63f354557c951bfb1299d3aead4b9a51fc6e47e0c5e7a2967db86c4c2

See more details on using hashes here.

File details

Details for the file lcpcli-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: lcpcli-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for lcpcli-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 70c3e9934f9d519080426362185dfd31c060fa42db4c88354edc242bbafbe2b2
MD5 a705323ce3e946620675a0547f4c3aee
BLAKE2b-256 aea3cabb1a53566ef47b56a76dfe243b07fd7f3b676638ad0a2c68b3ed24a8d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page