A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus.
Project description
inception2corpus-CLI
A CLI for retrieving a corpus annotated with named entities from INCEpTION instance to an archived and reusable corpus in context of any NER project.
This tool was created in the context of the NER4Archives project (INRIA/Archives nationales); it is adaptable and reusable for any other project under the terms of the MIT license.
The CLI launches a linear process, called a "pipeline", which executes the components in the following order:
- Fetch curated documents from INCEpTION instance (XMI - check state of document in Inception > "Monitoring" window);
- Re-tokenize curated documents;
- Convert XMI to CONLL files;
- Merge CONLL files in one;
- Provides a report containing statistics and metadata about the corpus;
- Reduce (get only sentences annotated and reject other) and serialize dataset in 2 (train/dev) and 3 sets (train/dev/test) according to a ratio defined by the user
At the end of the execution of the program, an output_annotated_corpus folder/
is provided in the root tool's folder, for more details see this section.
🛠️ Installation
MacOSx / Linux
-
In
./inception2corpus-CLI/
location, open a terminal -
Check if Python 3.7 or higher is installed
python --version
if not, install it here
- Create a code environment with virtualenv and correct Python version
virtualenv --python=/usr/bin/python3.7 venv
- Activate this code environment
source venv/bin/activate
- Finally, install the required packages
pip install -r requirements.txt
Windows
-
In
./inception2corpus-CLI/
location, open a terminal (powershell) -
Check if Python 3.7 or higher is installed
python --version
if not, install it here
- Create a code environment with virtualenv and correct Python version
py -m venv venv
- Activate this code environment
.\venv\Scripts\activate
- Finally, install the required packages
pip install -r .\requirements.txt
⚠️ Configuration before launch the tool
- Do not delete the
temp_files/
folder, leave it - Do not delete the
i2c_lib/
folder, leave it - Go to the USER_VAR_ENV.yml file and fill it with the correct information.
▶️ Usage
First activate (Cf. Installation section) code env and then follow:
method 1) In terminal, run:
python inception2corpus.py
method 2) In terminal, run:
chmod +x inception2corpus.py
then
./inception2corpus.py
📁 Full output folder description
./output_annotated_corpus folder/
|
|- output_annotated_corpus folder.zip/
| |
| |- data_split_n2/ : The all_reduced.conll divided into 2 sets (train, dev)
| |
| |- data_split_n3/ : The all_reduced.conll divided into 3 sets (train, dev, test)
| |
| |- data_split_n3_idx/ : The all_reduced.conll divided into 3 sets (train, dev, test) with sentences ID
| |
| |- data_split_n2_idx/ : The all_reduced.conll divided into 2 sets (train, dev) with sentences ID
| |
| |- XMI_curated/ : Original XMI to import into INCEpTION
| |
| |- all.conll : All documents in CONLL format
| |- all_reduced.conll : All documents in CONLL format reduced to only annotated sentences
|
|- meta_corpus.json : corpus metadata and statistics
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file inception2corpus-0.0.1.tar.gz
.
File metadata
- Download URL: inception2corpus-0.0.1.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b716a985fff27cfaa0c49744ceba6b9bc5e20434eaddf84afba9306a54eab263 |
|
MD5 | f30f3253f028b1046fd5e6ed4bae720c |
|
BLAKE2b-256 | cdfc9d78d479d1a5acf8f9529a71d50f39dcb6b6f5dc37b4a6d468a26fbf701d |
File details
Details for the file inception2corpus-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: inception2corpus-0.0.1-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38357b46d2b1de7596da351c0a4d8a243217d82753215b9e046e22fa43f10b63 |
|
MD5 | 22ed06fd2d627bf65129e370d0708405 |
|
BLAKE2b-256 | 6319b26d322e517d4b0bd5a144d00b7fb3d0f150669dc0f68e13df016161986d |