Skip to main content

Sentence annotation with Institutional Grammar 2.0 syntax with natural language processing for document analysis

Project description

Institutional Grammar 2.0 annotator

PyPI Latest Release

About

Python tool for processing and tagging sentences with IG 2.0 syntax with additional tools for text cleaning, preprocessing and postprocessing.

Manual

Installation

  1. Create a virtual environment:
python -m venv .env
  1. Activate the virtual environment:
source .env/bin/activate
  1. Install package
python -m pip install --upgrade pip
python -m pip install ig-tagger
ig-cli

Chain of command-line tools ig-cli

Possible tasks are executed as shell commands on files:

ig-cli <task_type> <input_file_path> -o output_file_path --some-additional-option

Help

To show information about possible commands, arguments and options execute:

ig-cli -h

Split text document into sentences

Input:

Plain .txt file with text.

Output:

Plain .txt file with sentences separated by new empty lines or .tsv file with ['sentence no.', 'sentence_type', 'text'] columns (with optional parameter --format=tsv)

Command:

ig-cli atomize input_text.txt
ig-cli atomize input_text.txt -o sentences.txt --split_type ml
ig-cli atomize input_text.txt --format txt

Optional parameters

  • --format (txt/tsv)
  • --output_file_path -o
  • --split_type (ml/rule_based)

About:

Complex sentences with enumerations are splitted into atomic sentences when it is possible. (xxx xxx (a) ccc, (b) vvv” -> “xxx xxx ccc”, “xxx xxx vvv”).

Split type possible values: ‘ml’, ‘rule_based’. ML variant uses a special tool (Spacy library) for recognizing the beginnings and ends of sentences in text. Rule-based variant uses simple matching based on capital letter and period at the end of the sentence (regular expressions).

These two are different approaches and can give different results. The basic option is rule_based, ml can do better with lower quality text because of considering whole sentence structure (not only dots and capital letters).

Both splits recognize enumeration based on a, b, c… or 1, 2, 3… to split bigger sentences into smaller ones. Which is implemented as matching such expressions (xxx xxx (a) ccc, (b) vvv”) in the sentence, then splitting and constructing new sentences from extracted parts (“xxx xxx ccc”, “xxx xxx vvv”).

For example:

  1. The employee is subject to (1) a Federal quarantine order related to COVID-19 (2) a Federal isolation order related to COVID-19.

  2. The employee is subject to a Federal quarantine order related to COVID-19.

  3. The employee is subject to a Federal isolation order related to COVID-19.

Sentences 2-3 are extracted from sentence 1 based on (1) (2) pattern.


Assign sentence type

Input:

Plain .txt file with sentences separated by two new lines (\n\n) or .tsv file with 3 columns ['sentence no.', 'sentence_type', 'text']. (Based on file extension)

Output:

.tsv file with 3 columns: ['sentence no.', 'sentence_type', 'text'].

Command:

ig-cli classify sentences.tsv

Optional parameters

  • --output_file_path -o

About:

Sentences are classified as regulative (r) or constitutive (c). For this purpose, simple ML model is prepared trained on a small annotated dataset. The output file should be reviewed and corrected manually.


IG tagging:

Input:

.tsv file with 3 columns ['sentence no.', text, 'sentence_type'] compatible with results of classify command.

Output:

.tsv file with tagged sentences

Command:

ig-cli tag classified_sentences.tsv -o tagged_sentences.tsv

Optional parameters

  • --output_file_path -o

About:

Tagging is based on natural language processing with linguistic features recognition and rules constructed for mapping linguistic features to Institutional Grammar tags. Every sentence is analysed accordingly then results are saved with tags corresponding to each word token.


Conversion to horizontal Excel format of IG document (in the future)

Input:

Output:

Command:

About:


Comparison of results

Comparison between files (e.g. for quality/error assessment) is possible via other tools such as (diff - command line tool (use diff -h for detailed instruction), diffchecker - web tool)

Technical information

Update of models

  • Sentence type classification - The ML model can be changed/retrained as a new file with serialized Python object with .predict(self, sentences: List[str]) -> List[bool] method and returns True for regulative sentences. Corrected files can be gathered for building better classifier.

Programming interface

The package can be used within import igtagger with object-oriented operations included in igtagger.backend and file operations included in igtagger.frontend.

from igtagger import backend
backend.get_annotated_sentences(df)

Detailed documentation is in the PDF.

Contributions

Project leaders: Anna Wróblewska, Bartosz Pieliński

The tool is based on the results of previous work on Institutional Grammar annotation:

  1. Group project for the previous version of IG syntax and Polish language - link
  2. Work by Aleksandra Wichrowska on developing rules for English language and new IG 2.0 syntax - link
  3. Work by Karolina Seweryn on ML models: constitutive/regulative classification

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ig-tagger-1.3.5.tar.gz (189.8 kB view details)

Uploaded Source

Built Distribution

ig_tagger-1.3.5-py3-none-any.whl (201.9 kB view details)

Uploaded Python 3

File details

Details for the file ig-tagger-1.3.5.tar.gz.

File metadata

  • Download URL: ig-tagger-1.3.5.tar.gz
  • Upload date:
  • Size: 189.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for ig-tagger-1.3.5.tar.gz
Algorithm Hash digest
SHA256 2dc5b01b3319123d52805719d97e421d50893b638cb068b170ce94a69a475763
MD5 ea5ef7ee54ccb5bb4f45ed8893cf27ff
BLAKE2b-256 2d75b4b69e32b99963784464361bf898d7b96f1c2def7b8a63aaac50b6f2c1e7

See more details on using hashes here.

File details

Details for the file ig_tagger-1.3.5-py3-none-any.whl.

File metadata

  • Download URL: ig_tagger-1.3.5-py3-none-any.whl
  • Upload date:
  • Size: 201.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for ig_tagger-1.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2d1cce585d31a5dae075dfef5ef580aff5390f7541e6d58f357a735d3d0ff5f3
MD5 233886259a1bcbe08f5a35f05b89434e
BLAKE2b-256 5b6926f89a4cd9ea47b6c4d8bbb1638309df5e5dfd479c4afc28eacfb0f03402

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page