Skip to main content

A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus.

Project description

inception2corpus-CLI

A CLI for retrieving a corpus annotated with named entities from INCEpTION instance to an archived and reusable corpus in context of any NER project.

This tool was created in the context of the NER4Archives project (INRIA/Archives nationales); it is adaptable and reusable for any other project under the terms of the MIT license.

Python Version License: MIT

The CLI launches a linear process, called a "pipeline", which executes the components in the following order:

  • Fetch curated documents from INCEpTION instance (XMI - check state of document in Inception > "Monitoring" window);

curated-doc

  • Re-tokenize curated documents;
  • Convert XMI to CONLL files;
  • Merge CONLL files in one;
  • Provides a report containing statistics and metadata about the corpus;
  • Reduce (get only sentences annotated and reject other) and serialize dataset in 2 (train/dev) and 3 sets (train/dev/test) according to a ratio defined by the user

At the end of the execution of the program, an output_annotated_corpus folder/ is provided in the root tool's folder, for more details see this section.

🛠️ Installation

MacOSx / Linux

  1. In ./inception2corpus-CLI/ location, open a terminal

  2. Check if Python 3.7 or higher is installed

python --version

if not, install it here

  1. Create a code environment with virtualenv and correct Python version
virtualenv --python=/usr/bin/python3.7 venv
  1. Activate this code environment
source venv/bin/activate
  1. Finally, install the required packages
pip install -r requirements.txt

Windows

  1. In ./inception2corpus-CLI/ location, open a terminal (powershell)

  2. Check if Python 3.7 or higher is installed

python --version

if not, install it here

  1. Create a code environment with virtualenv and correct Python version
py -m venv venv
  1. Activate this code environment
.\venv\Scripts\activate
  1. Finally, install the required packages
pip install -r .\requirements.txt

⚠️ Configuration before launch the tool

  • Do not delete the temp_files/ folder, leave it
  • Do not delete the i2c_lib/ folder, leave it
  • Go to the USER_VAR_ENV.yml file and fill it with the correct information.

▶️ Usage

First activate (Cf. Installation section) code env and then follow:

method 1) In terminal, run:

python inception2corpus.py

method 2) In terminal, run:

chmod +x inception2corpus.py

then

./inception2corpus.py

📁 Full output folder description

./output_annotated_corpus folder/
 |
 |- output_annotated_corpus folder.zip/
 |           |
 |           |- data_split_n2/ : The all_reduced.conll divided into 2 sets (train, dev)
 |           |
 |           |- data_split_n3/ : The all_reduced.conll divided into 3 sets (train, dev, test)
 |           |
 |           |- data_split_n3_idx/ : The all_reduced.conll divided into 3 sets (train, dev, test) with sentences ID
 |           |
 |           |- data_split_n2_idx/ : The all_reduced.conll divided into 2 sets (train, dev) with sentences ID
 |           |
 |           |- XMI_curated/ : Original XMI to import into INCEpTION
 |           |
 |           |- all.conll : All documents in CONLL format
 |           |- all_reduced.conll : All documents in CONLL format reduced to only annotated sentences
 |
 |- meta_corpus.json : corpus metadata and statistics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inception2corpus-0.0.4.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

inception2corpus-0.0.4-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file inception2corpus-0.0.4.tar.gz.

File metadata

  • Download URL: inception2corpus-0.0.4.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for inception2corpus-0.0.4.tar.gz
Algorithm Hash digest
SHA256 6fd5113e0a0b6a26b1f019ad154d0e73ef9e5a413606b7b6025bd795e8f4b69f
MD5 2d3f90826f86530c0b9b222e0e7a5d35
BLAKE2b-256 ba285b04e2ee4d6d64b54279e21c78b8b63129b324e42a2e7000c32e64c7a113

See more details on using hashes here.

File details

Details for the file inception2corpus-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for inception2corpus-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d3bbae953a89f0fcd2ab0907cb3bb901dbdf441cfa96e824046f9f230bc174f2
MD5 758398625a1d6cb654c5c780459d9868
BLAKE2b-256 d9cd6f8f22d732f707935f6105ad45e28bcd588a8318616f80205abfed82316c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page