Skip to main content

A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus.

Project description

inception2corpus-CLI

A CLI for retrieving a corpus annotated with named entities from INCEpTION instance to an archived and reusable corpus in context of any NER project.

This tool was created in the context of the NER4Archives project (INRIA/Archives nationales); it is adaptable and reusable for any other project under the terms of the MIT license.

Python Version License: MIT

The CLI launches a linear process, called a "pipeline", which executes the components in the following order:

  • Fetch curated documents from INCEpTION instance (XMI - check state of document in Inception > "Monitoring" window);

curated-doc

  • Re-tokenize curated documents;
  • Convert XMI to CONLL files;
  • Merge CONLL files in one;
  • Provides a report containing statistics and metadata about the corpus;
  • Reduce (get only sentences annotated and reject other) and serialize dataset in 2 (train/dev) and 3 sets (train/dev/test) according to a ratio defined by the user

At the end of the execution of the program, an output_annotated_corpus folder/ is provided in the root tool's folder, for more details see this section.

🛠️ Installation

MacOSx / Linux

  1. In ./inception2corpus-CLI/ location, open a terminal

  2. Check if Python 3.7 or higher is installed

python --version

if not, install it here

  1. Create a code environment with virtualenv and correct Python version
virtualenv --python=/usr/bin/python3.7 venv
  1. Activate this code environment
source venv/bin/activate
  1. Finally, install the required packages
pip install -r requirements.txt

Windows

  1. In ./inception2corpus-CLI/ location, open a terminal (powershell)

  2. Check if Python 3.7 or higher is installed

python --version

if not, install it here

  1. Create a code environment with virtualenv and correct Python version
py -m venv venv
  1. Activate this code environment
.\venv\Scripts\activate
  1. Finally, install the required packages
pip install -r .\requirements.txt

⚠️ Configuration before launch the tool

  • Do not delete the temp_files/ folder, leave it
  • Do not delete the i2c_lib/ folder, leave it
  • Go to the USER_VAR_ENV.yml file and fill it with the correct information.

▶️ Usage

First activate (Cf. Installation section) code env and then follow:

method 1) In terminal, run:

python inception2corpus.py

method 2) In terminal, run:

chmod +x inception2corpus.py

then

./inception2corpus.py

📁 Full output folder description

./output_annotated_corpus folder/
 |
 |- output_annotated_corpus folder.zip/
 |           |
 |           |- data_split_n2/ : The all_reduced.conll divided into 2 sets (train, dev)
 |           |
 |           |- data_split_n3/ : The all_reduced.conll divided into 3 sets (train, dev, test)
 |           |
 |           |- data_split_n3_idx/ : The all_reduced.conll divided into 3 sets (train, dev, test) with sentences ID
 |           |
 |           |- data_split_n2_idx/ : The all_reduced.conll divided into 2 sets (train, dev) with sentences ID
 |           |
 |           |- XMI_curated/ : Original XMI to import into INCEpTION
 |           |
 |           |- all.conll : All documents in CONLL format
 |           |- all_reduced.conll : All documents in CONLL format reduced to only annotated sentences
 |
 |- meta_corpus.json : corpus metadata and statistics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inception2corpus-0.0.1.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

inception2corpus-0.0.1-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file inception2corpus-0.0.1.tar.gz.

File metadata

  • Download URL: inception2corpus-0.0.1.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for inception2corpus-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b716a985fff27cfaa0c49744ceba6b9bc5e20434eaddf84afba9306a54eab263
MD5 f30f3253f028b1046fd5e6ed4bae720c
BLAKE2b-256 cdfc9d78d479d1a5acf8f9529a71d50f39dcb6b6f5dc37b4a6d468a26fbf701d

See more details on using hashes here.

File details

Details for the file inception2corpus-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for inception2corpus-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 38357b46d2b1de7596da351c0a4d8a243217d82753215b9e046e22fa43f10b63
MD5 22ed06fd2d627bf65129e370d0708405
BLAKE2b-256 6319b26d322e517d4b0bd5a144d00b7fb3d0f150669dc0f68e13df016161986d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page