Skip to main content

A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus.

Project description

inception2corpus-CLI

A CLI for retrieving a corpus annotated with named entities from INCEpTION instance to an archived and reusable corpus in context of any NER project.

This tool was created in the context of the NER4Archives project (INRIA/Archives nationales); it is adaptable and reusable for any other project under the terms of the MIT license.

Python Version License: MIT

The CLI launches a linear process, called a "pipeline", which executes the components in the following order:

  • Fetch curated documents from INCEpTION instance (XMI - check state of document in Inception > "Monitoring" window);

curated-doc

  • Re-tokenize curated documents;
  • Convert XMI to CONLL files;
  • Merge CONLL files in one;
  • Provides a report containing statistics and metadata about the corpus;
  • Reduce (get only sentences annotated and reject other) and serialize dataset in 2 (train/dev) and 3 sets (train/dev/test) according to a ratio defined by the user

At the end of the execution of the program, an output_annotated_corpus folder/ is provided in the root tool's folder, for more details see this section.

🛠️ Installation

MacOSx / Linux

  1. In ./inception2corpus-CLI/ location, open a terminal

  2. Check if Python 3.7 or higher is installed

python --version

if not, install it here

  1. Create a code environment with virtualenv and correct Python version
virtualenv --python=/usr/bin/python3.7 venv
  1. Activate this code environment
source venv/bin/activate
  1. Finally, install the required packages
pip install -r requirements.txt

Windows

  1. In ./inception2corpus-CLI/ location, open a terminal (powershell)

  2. Check if Python 3.7 or higher is installed

python --version

if not, install it here

  1. Create a code environment with virtualenv and correct Python version
py -m venv venv
  1. Activate this code environment
.\venv\Scripts\activate
  1. Finally, install the required packages
pip install -r .\requirements.txt

⚠️ Configuration before launch the tool

  • Do not delete the temp_files/ folder, leave it
  • Do not delete the i2c_lib/ folder, leave it
  • Go to the USER_VAR_ENV.yml file and fill it with the correct information.

▶️ Usage

First activate (Cf. Installation section) code env and then follow:

method 1) In terminal, run:

python inception2corpus.py

method 2) In terminal, run:

chmod +x inception2corpus.py

then

./inception2corpus.py

📁 Full output folder description

./output_annotated_corpus folder/
 |
 |- output_annotated_corpus folder.zip/
 |           |
 |           |- data_split_n2/ : The all_reduced.conll divided into 2 sets (train, dev)
 |           |
 |           |- data_split_n3/ : The all_reduced.conll divided into 3 sets (train, dev, test)
 |           |
 |           |- data_split_n3_idx/ : The all_reduced.conll divided into 3 sets (train, dev, test) with sentences ID
 |           |
 |           |- data_split_n2_idx/ : The all_reduced.conll divided into 2 sets (train, dev) with sentences ID
 |           |
 |           |- XMI_curated/ : Original XMI to import into INCEpTION
 |           |
 |           |- all.conll : All documents in CONLL format
 |           |- all_reduced.conll : All documents in CONLL format reduced to only annotated sentences
 |
 |- meta_corpus.json : corpus metadata and statistics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inception2corpus-0.0.2.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

inception2corpus-0.0.2-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file inception2corpus-0.0.2.tar.gz.

File metadata

  • Download URL: inception2corpus-0.0.2.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for inception2corpus-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7b5aff0d20c09f985f8e05afc7d989964f9c5a7aef404662303f22e3f6882f6b
MD5 c819445c102306b8a45b560f9d14b497
BLAKE2b-256 fef6b01a7e95f6d796805404ff9a38ee5e6d260464139ecfb1ccfa58a701e162

See more details on using hashes here.

File details

Details for the file inception2corpus-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for inception2corpus-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b3009ed9a13709801125cee178e741d719c1f91fecb7cd05e1191293b7b1ea4d
MD5 22f43e9a8c7a0ed6e2588fd51ba2b713
BLAKE2b-256 2d93bf6c457a8daa87518a3aa7b3eb983c252f1b8565184c7171fbaba7993f92

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page