Skip to main content

A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus.

Project description

inception2corpus-CLI

A CLI for retrieving a corpus annotated with named entities from INCEpTION instance to an archived and reusable corpus in context of any NER project.

This tool was created in the context of the NER4Archives project (INRIA/Archives nationales); it is adaptable and reusable for any other project under the terms of the MIT license.

Python Version License: MIT PyPI version

The CLI launches a linear process, called a "pipeline", which executes the components in the following order:

  • Fetch curated documents (XMI format) from an INCEpTION instance (check state of document in Inception > "Monitoring" window);

curated-doc

  • preprocessing curated documents (retokenize, remove unprintable characters etc.);
  • Convert XMI to CONLL files (inception2corpus use xmi2conll cli as a module);
  • Merge CONLL files in one;
  • Provides a report containing statistics and metadata about the corpus;
  • Reduce (get only sentences annotated and reject other) and serialize dataset in 2 (train/dev) and 3 sets (train/dev/test) according to a ratio defined by the user

At the end of the execution of the program, an output_annotated_corpus folder/ is provided at the root working directory, for more details see this section.

🛠️ Installation (easy way)

  1. You need Python 3.7 or higher installed (if not, install it here).

  2. First, create a new directory and set up a code environment with virtualenv and correct Python version, follow these steps (depending on your OS):

    MacOSx / Linux

    virtualenv --python=/usr/bin/python3.7 venv
    

    then, activate this new code environment with:

    source venv/bin/activate
    

    Windows

    py -m venv venv
    

    then, activate this new code environment with:

    .\venv\Scripts\activate
    
  3. Finally, install inception2corpus CLI via pip with:

    pip install inception2corpus
    

🛠️ Installation (for developers only)

# 1. clone git repository
git clone https://github.com/NER4Archives-project/inception2corpus-CLI.git
# 2. Go to repository and create a new virtual env (follow steps in easy way installation)
# 3. install packages
# (on MACOSx/LINUX): 
pip install -r requirements.txt
# (on Windows): 
pip install -r .\requirements.txt

▶️ Usage

  1. inception2corpus CLI use a YAML file as argument to specify INCEpTION HOST information, corpus metadata, conll format, serialization options etc. You can use and update the template here USER_VAR_ENV.yml.

  2. When configuration YAML file is completed use this command:

    inception2corpus ./USER_VAR_ENV.yml
    
  3. At the end of this process, a new output directory is created at the root of working directory (./output_annotated_corpus folder/) that contains your final corpus, ready to train. Also, a new temp_files/ folder is created at the root, leave it or delete it as you want.

📁 Full output folder description

./output_annotated_corpus folder/
 |
 |- output_annotated_corpus folder.zip/
 |           |
 |           |- data_split_n2/ : The all_reduced.conll divided into 2 sets (train, dev)
 |           |
 |           |- data_split_n3/ : The all_reduced.conll divided into 3 sets (train, dev, test)
 |           |
 |           |- data_split_n3_idx/ : The all_reduced.conll divided into 3 sets (train, dev, test) with sentences ID
 |           |
 |           |- data_split_n2_idx/ : The all_reduced.conll divided into 2 sets (train, dev) with sentences ID
 |           |
 |           |- XMI_curated/ : Original XMI to import into INCEpTION
 |           |
 |           |- all.conll : All documents in CONLL format
 |           |- all_reduced.conll : All documents in CONLL format reduced to only annotated sentences
 |
 |- meta_corpus.json : corpus metadata and statistics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inception2corpus-0.1.2.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

inception2corpus-0.1.2-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file inception2corpus-0.1.2.tar.gz.

File metadata

  • Download URL: inception2corpus-0.1.2.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for inception2corpus-0.1.2.tar.gz
Algorithm Hash digest
SHA256 62f87b45091a7b25c49ac55fd61236f9f4c2bfec866c5e2fa2cc6494c5192c1f
MD5 3f944535c57d7a2a709e99e66feb9a92
BLAKE2b-256 d1c2a89a76b74d4df1b92227f4041fa28f01d91d650d4f87a774070c9e4a6d0d

See more details on using hashes here.

File details

Details for the file inception2corpus-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for inception2corpus-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f0cc800dff7d5c59a25e6767b6f632f415a650306f9468c2ea08703e68ad722f
MD5 3fa54267430d6284c554883d14cc3f86
BLAKE2b-256 21023eaa78043ebe0f3c19b646e360819087a7609fcea8be40b5e047fd8c9235

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page