EIS1600 project tools and utilities
Project description
EIS1600 Tools
File Preparation
- Convert from mARkdown to EIS1600TMP with
convert_mARkdown_to_EIS1600
- Check the
.EIS1600TMP
and correct tagged structure - Mark file as ready in the Google Spreadsheet (this includes the file into our processing pipeline)
- Optional: Run
ids_insert_or_update
on the checked.EIS1600TMP
(or runincorporate_newly_prepared_files_in_corpus
which will add IDs for all files listed as ready or double-checked).
If you need to change the tagged structure in an .EIS1600
file, you do those changes with Simple Markdown.
Run ids_insert_or_update
to convert the changes in Simple Markdown to EIS1600 mARkdown.
Processing Workflow
- Run
incorporate_newly_prepared_files_in_corpus
. This script downloads the Google Sheet and processes all ready and double-checked files:- Ready files are converted from EIS1600TMP to EIS1600 and IDs are added;
- Formatting of ready files (now EIS1600 files) and double-checked files is checked;
- IDs are updated if necessary.
Files are now finalized and ready to be processed by the pipeline.
- Run
analyse_all_on_cluster
. This script analysis all files prepared by the previous step:- Each file is disassembled into MIUs;
- Analysis routine is run for each MIU;
- Results are returned as a JSON for each file that contains the annotated text, the populated yml and the analysis results (as df).
The JSON files are ready to be imported into our database.
Installation
You can either do the complete local setup and have everything installed on your machine. Alternatively, you can also use the docker image which can execute all the commands from the EIS1600-pkg.
Docker Installation
Install Docker Desktop: https://docs.docker.com/desktop/install/mac-install/
It should install Docker Engine as well, which can be used through command line interface (CLI).
To run a script from the EIS1600-pkg with docker, give the command to docker through CLI:
$ docker run <--gpus all> -it -v "</path/to/EIS1600>:/EIS1600" eis1600-pkg <EIS1600-pkg-command and its params>
Explanation:
docker run
starts the image,-it
propagates CLI input to the image.--gpus all
, optional to run docker with GPUs.-v
will virtualize a directory from your system in the docker image.-v
virtualized</path/to/EIS1600>
from your system to/EIS1600
in the docker image. You give the absolute path to ourEIS1600
parent directory on your machine. Make sure to replace</path/to/EIS1600>
with the correct path on your machine! This is the part in front of the colon, after the colon the destination inside the docker image is specified (this one is fixed).eis1600-pkg
the repository name on docker hub from where the image will be downloaded- Last, the command from the package you want to execute including all parameters required by that command.
E.G., to run q_tags_to_bio
for toponym descriptions through docker:
$ docker run -it -v "</path/to/EIS1600>:/EIS1600" eis1600-pkg q_tags_to_bio Topo_Data/MIUs/ TOPONYM_DESCRIPTION_DETECTION/toponym_description_training_data TOPD
To run the annotation pipeline:
$ docker run --gpus all -it -v "</path/to/EIS1600>:/EIS1600" eis1600-pkg analyse_all_on_cluster
Maybe add -D
as parameter to analyse_all_on_cluster
because parallel processing does not work with GPU.
Local Setup
After creating and activating the eis16000_env (see Set Up), use:
$ pip install eis1600
In case you have an older version installed, use:
$ pip install --upgrade eis1600
The package comes with different options, to install camel-tools use the following command. Check also their installation instructions because atm they require additional packages https://camel-tools.readthedocs.io/en/latest/getting_started.html#installation
$ pip install 'eis1600[NER]'
If you want to run the annotation pipeline, you also need to download camel-tools data:
$ camel_data -i disambig-mle-calima-msa-r13
To run the annotation pipeline with GPU, use this command:
$ pip install 'eis1600[EIS]'
Note. You can use pip freeze
to check the versions of all installed packages, including eis1600
.
Common Error Messages
You need to download all the models ONE BY ONE from Google Drive. Something breaks if you try to download the whole folder, and you get this error:
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory EIS1600_Pretrained_Models/camelbert-ca-finetuned
Better to sync EIS1600_Pretrained_Models
with our nextcloud.
If you want to install eis1600-pkg
from source you have to add the data modules for gazetteers
and helper
manually.
You can find the modules in our nextcloud.
Set Up Virtual Environment and Install the EIS1600 PKG there
To not mess with other python installations, we recommend installing the package in a virual environment. To create a new virtual environment with python, run:
python3 -m venv eis1600_env
NB: while creating your new virtual environment, you must use Python 3.7 or 3.8, as these are version required by CAMeL-Tools.
After creation of the environment it can be activated by:
source eis1600_env/bin/activate
The environment is now activated and the eis1600 package can be installed into that environment with pip:
$ pip install eis1600
This command installs all dependencies as well, so you should see lots of other libraries being installed. If you do not, you must have used a wrong version of Python while creating your virtual environment.
You can now use the commands listed in this README.
To use the environment, you have to activate it for every session, by:
source eis1600_env/bin/activate
After successful activation, your user has the pre-text (eis1600_env)
.
Probably, you want to create an alias for the source command in your alias file by adding the following line:
alias eis="source eis1600_env/bin/activate"
Alias files:
- on Linux:
.bash_aliases
- On Mac:
.zshrc
if you usezsh
(default in the latest versions Mac OS);
Structure of the working directory
The working directory is always the main EIS1600
directory which is a parent to all the different repositories.
The EIS1600
directory has the following structure:
|
|---| eis_env
|---| EIS1600_JSONs
|---| EIS1600_Pretrained_Models (for annotation, sync from Nextcloud)
|---| gazetteers
|---| Master_Chronicle
|---| OpenITI_EIS1600_Texts
|---| Training_Data
Path variables are in the module eis1600/helper/repo
.
Usage
All commands must be run from the parent directory EIS1600
!
See also Processing Workflow.
Annotation Pipeline
Use -D
flag to run annotation of MIUs in sequence, otherwise the annotation will be run in parallel, and it will eat up ALL resources.
$ analyse_all_on_cluster
Convert mARkdown to EIS1600 files
Converts mARkdown file to EIS1600TMP (without inserting UIDs).
The .EIS1600TMP file will be created next to the .mARkdown file (you can input .inProcess or .completed files as well).
This command can be run from anywhere within the text repo - use auto complete (tab
) to get the correct path to the file.
Alternative: open command line from the folder which contains the file which shall be converted.
$ convert_mARkdown_to_EIS1600TMP <uri>.mARkdown
Batch processing of mARkdown files
Run from the parent directory EIS1600
.
Use the -e
option to convert all files from the EIS1600 repo.
$ convert_mARkdown_to_EIS1600 -e <EIS1600_repo>
EIS1600TMP to EIS1600
EIS1600TMP files do not contain IDs yet, to insert IDs run ids_insert_or_update
on the .EIS1600TMP
file.
Use auto complete (tab
) to get the correct path to the file.
$ ids_insert_or_update <OpenITI_EIS1600_Text/data/spath/to/file>.EIS1600TMP
Additionally, this routine updates IDs if you run it on a .EIS1600
file.
Update IDs means inserting missing UIDs and updating SubIDs.
$ ids_insert_or_update <OpenITI_EIS1600_Text/data/spath/to/file>.EIS1600
Batch processing
See also Processing Workflow.
Use incorporate_newly_prepared_files_in_corpus
to add IDs to all ready files from the Google Sheet.
$ incorporate_newly_prepared_files_in_corpus
Annotation
NER annotation for persons, toponyms, misc, and also dates, beginning and ending of onomastic information (NASAB), and onomastic information.
Note Can only be run if package was installed with NER flag AND if the ML models are in the EIS1600_Pretrained_Models directory.
If no input is given, annotation is run for the whole repository. Can be used with -p
option for parallelization.
Run from the parent directory EIS1600
(internally used path starts with: EIS1600_MIUs/
).
$ annotate_mius -p
To annotate all MIU files of a text give the IDs file as argument.
Can be used with -p
option to run in parallel.
$ annotate_mius <uri>.IDs
To annotate an individual MIU file, give MIU file as argument.
$ annotate_mius <uri>/MIUs/<uri>.<UID>.EIS1600
Only Onomastic Annotation
Only for test purposes!
Can be run with -D
to process one file at a time, otherwise runs in parallel.
Can be run with -T
to use gold-standard data as input.
Run from the parent directory EIS1600
.
$ onomastic_annotation
Get training data from Q annotations
This script can be used to transform Q-tags from EIS1600-mARkdown to BIO-labels. The script will operate on a directory of MIUs and write a JSON file with annotated MIUs in BIO training format. Parameters are:
- Path to directory containing annotated MIUs;
- Filename or path inside RESEARCH_DATA repo for JSON output file
- BIO_main_class, optional, defaults to 'Q'. Try to use something more meaningful and distinguishable.
$ q_tags_to_bio <path/to/MIUs/> <q_training_data> <bio_main_class>
For toponym definitions/descriptions:
$ q_tags_to_bio Topo_Data/MIUs/ TOPONYM_DESCRIPTION_DETECTION/toponym_description_training_data TOPD
MIU revision
Run the following command from the root of the MIU repo to revise automated annotated files:
$ miu_random_revisions
When first run, the file file_picker.yml is added to the root of the MIU repository. Make sure to specify your operating system and to set your initials and the path/command to/for Kate in this YAML file.
system: ... # options: mac, lin, win;
reviewer: eis1600researcher # change this to your name;
path_to_kate: kate # add absolute path to Kate on your machine; or a working alias (kate should already work)
Optional, you can specify a path from where to open files - e.g. if you only want to open training-data, set:
miu_main_path: ./training_data/
When revising files, remember to change
reviewed : NOT REVIEWED
to
reviewed : REVIEWED
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.