Skip to main content

ACCORD-NLP: Transformer/language model-based information extraction from regulatory text

Project description

ACCORD-NLP Framework

ACCORD-NLP is a Natural Language Processing (NLP) framework developed as a part of the Horizon European project for Automated Compliance Checks for Construction, Renovation or Demolition Works (ACCORD) to facilitate Automated Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector.

Compliance checking plays a pivotal role in the AEC sector, ensuring the safety, reliability, stability, and usability of building designs. Traditionally, this process relied on manual approaches, which are resource-intensive and time-consuming. Thus, attention has shifted towards automated methods to streamline compliance checks. Automating these processes necessitates the transformation of building regulations written in text aiming domain experts into machine-processable formats. However, this has been challenging, primarily due to the inherent complexities and unstructured nature of natural languages. Moreover, regulatory texts often exhibit domain-specific characteristics, ambiguities, and intricate clausal structures, further complicating the task.

ACCORD-NLP offers data, AI models and workflows developed using state-of-the-art NLP techniques to extract rules from textual data, supporting ACC.

Installation

As the initial step, Pytorch needs to be installed. The recommended Pytorch version is 2.0.1. Please refer to PyTorch installation page for the specific installation command for your platform.

Once PyTorch has been installed, accord-nlp can be installed either from the source or as a Python package via pip. The latter approach is recommended.

From Source

git clone https://github.com/Accord-Project/accord-nlp.git
cd accord-nlp
pip install -r requirements.txt

From pip

pip install accord-nlp

Features

  1. Data Augmentation
  2. Entity Classification
  3. Relation Classification
  4. Information Extraction

Data Augmentation

Data augmentation supports the synthetic oversampling of relation-annotated data within a domain-specific context. It can be used using the following code. The original experiment script is available here.

from accord_nlp.data_augmentation import RelationDA

entities = ['object', 'property', 'quality', 'value']
rda = RelationDA(entity_categories=entities)

relations_path = '<.csv file path to original relation-annotated data>'
entities_path = '<.csv file path to entity samples per category>'
output_path = '<.csv file path to save newly created data>'
rda.replace_entities(relations_path, entities_path, output_path, n=12)

Available Datasets

The data augmentation approach was applied to the relation-annotated training data in the CODE-ACCORD corpus. It generated 2,912 synthetic data samples, resulting in a training set of 6,375 relations. Our paper, listed below, provides more details about the data statistics.

The augmented training dataset can be loaded into a Pandas DataFrame using the following code.

from datasets import Dataset
from datasets import load_dataset

data_files = {"augmented_train": "augmented.csv"}
augmented_train = Dataset.to_pandas(load_dataset("ACCORD-NLP/CODE-ACCORD-Relations", data_files=data_files, split="augmented_train"))

Entity Classification

We adapted the transformer's sequence labelling architecture to fine-tune the entity classifier, following its remarkable results in the NLP domain. The general transformer architecture was modified by adding individual softmax layers per output token to support entity classification.

Our paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations.

Available Models

We fine-tuned four pre-trained transformer models (i.e. BERT, ELECTRA, ALBERT and ROBERTA) for entity classification. All the fine-tuned models are available in HuggingFace, and can be accessed using the following code.

from accord_nlp.text_classification.ner.ner_model import NERModel

model = NERModel('roberta', 'ACCORD-NLP/ner-roberta-large')
predictions, raw_outputs = model.predict(['The gradient of the passageway should not exceed five per cent.'])
print(predictions)

Relation Classification

Relation classification aims to predict the semantic relationship between two entities within a context. We introduced four special tokens (i.e. <e1>, </e1>, <e2> and </e2>) to format the input text with an entity pair to facilitate relation classification. Both <e1> and </e1> mark the start and end of the first entity in the selected text sequence, while <e2> and </e2> mark the start and end of the second entity. The transformer output corresponds to <e1> and <e2> were passed through a softmax layer to predict the relation category.

Our paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations.

Available Models

We fine-tuned three pre-trained transformer models (i.e. BERT, ALBERT and ROBERTA) for relation classification. All the fine-tuned models are available in HuggingFace, and can be accessed using the following code.

from accord_nlp.text_classification.relation_extraction.re_model import REModel

model = REModel('roberta', 'ACCORD-NLP/re-roberta-large')
predictions, raw_outputs = model.predict(['The <e1>gradient<\e1> of the passageway should not exceed <e2>five per cent</e2>.'])
print(predictions)

Information Extraction

Our information extraction pipeline aims to transform a regulatory sentence into a machine-processable output (i.e., a knowledge graph of entities and relations). It utilises the entity and relation classifiers mentioned above to sequentially extract information from the text to build the final graph.

Our paper, listed below, provides more details about the pipeline, including its individual components.

The default pipeline configurations are set to the best-performed entity and relation classification models, and the default pipeline can be accessed using the following code.

from accord_nlp.information_extraction.ie_pipeline import InformationExtractor

sentence = 'The gradient of the passageway should not exceed five per cent.'

ie = InformationExtractor()
ie.sentence_to_graph(sentence)

The following code can be used to access the pipeline with different configurations. Please refer to the ie_pipeline.py for more details about the input parameters.

from accord_nlp.information_extraction.ie_pipeline import InformationExtractor

sentence = 'The gradient of the passageway should not exceed five per cent.'

ner_model_info = ('roberta', 'ACCORD-NLP/ner-roberta-large')
re_model_info = ('roberta', 'ACCORD-NLP/re-roberta-large')
ie = InformationExtractor(ner_model_info=ner_model_info, re_model_info=re_model_info, debug=True)
ie.sentence_to_graph(sentence)

Also, a live demo of the Information Extractor is available in HuggingFace.

Reference

Please note that the corresponding paper for this work is currently in progress and will be made available soon. Thank you for your patience and interest.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

accord_nlp-1.0.2.tar.gz (72.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

accord_nlp-1.0.2-py3-none-any.whl (79.7 kB view details)

Uploaded Python 3

File details

Details for the file accord_nlp-1.0.2.tar.gz.

File metadata

  • Download URL: accord_nlp-1.0.2.tar.gz
  • Upload date:
  • Size: 72.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for accord_nlp-1.0.2.tar.gz
Algorithm Hash digest
SHA256 58d8c569462280957793aae4c561d216e7a3469d89286837ea018880217af35d
MD5 8843291fb0c9e0e55e3012e8aaf0065f
BLAKE2b-256 33633aed651d489d98284240884431bae3c15eb09b01138210e675f391b3acbe

See more details on using hashes here.

File details

Details for the file accord_nlp-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: accord_nlp-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 79.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for accord_nlp-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 49e316d377ebec33c9761332f6f5fed048593b58c0e0db5b25d7dcdf98ea9c6c
MD5 64f5acde384d036c6c9c7d72961634b6
BLAKE2b-256 e783fa8e012eb28afd01d4492fe5979e5056d4318fb4d4658b09671c17748029

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page