ACCORD-NLP: Transformer/language model-based information extraction from regulatory text
Project description
ACCORD-NLP Framework
ACCORD-NLP is a Natural Language Processing (NLP) framework developed as a part of the Horizon European project for Automated Compliance Checks for Construction, Renovation or Demolition Works (ACCORD) to facilitate Automated Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector.
Compliance checking plays a pivotal role in the AEC sector, ensuring the safety, reliability, stability, and usability of building designs. Traditionally, this process relied on manual approaches, which are resource-intensive and time-consuming. Thus, attention has shifted towards automated methods to streamline compliance checks. Automating these processes necessitates the transformation of building regulations written in text aiming domain experts into machine-processable formats. However, this has been challenging, primarily due to the inherent complexities and unstructured nature of natural languages. Moreover, regulatory texts often exhibit domain-specific characteristics, ambiguities, and intricate clausal structures, further complicating the task.
ACCORD-NLP offers data, AI models and workflows developed using state-of-the-art NLP techniques to extract rules from textual data, supporting ACC.
Installation
As the initial step, Pytorch needs to be installed. The recommended Pytorch version is 2.0.1. Please refer to PyTorch installation page for the specific installation command for your platform.
Once PyTorch has been installed, accord-nlp can be installed either from the source or as a Python package via pip. The latter approach is recommended.
From Source
git clone https://github.com/Accord-Project/accord-nlp.git
cd accord-nlp
pip install -r requirements.txt
From pip
pip install accord-nlp
Features
Data Augmentation
Data augmentation supports the synthetic oversampling of relation-annotated data within a domain-specific context. It can be used using the following code. The original experiment script is available here.
from accord_nlp.data_augmentation import RelationDA
entities = ['object', 'property', 'quality', 'value']
rda = RelationDA(entity_categories=entities)
relations_path = '<.csv file path to original relation-annotated data>'
entities_path = '<.csv file path to entity samples per category>'
output_path = '<.csv file path to save newly created data>'
rda.replace_entities(relations_path, entities_path, output_path, n=12)
Available Datasets
The data augmentation approach was applied to the relation-annotated training data in the CODE-ACCORD corpus. It generated 2,912 synthetic data samples, resulting in a training set of 6,375 relations. Our paper, listed below, provides more details about the data statistics.
The augmented training dataset can be loaded into a Pandas DataFrame using the following code.
from datasets import Dataset
from datasets import load_dataset
data_files = {"augmented_train": "augmented.csv"}
augmented_train = Dataset.to_pandas(load_dataset("ACCORD-NLP/CODE-ACCORD-Relations", data_files=data_files, split="augmented_train"))
Entity Classification
We adapted the transformer's sequence labelling architecture to fine-tune the entity classifier, following its remarkable results in the NLP domain. The general transformer architecture was modified by adding individual softmax layers per output token to support entity classification.
Our paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations.
Available Models
We fine-tuned four pre-trained transformer models (i.e. BERT, ELECTRA, ALBERT and ROBERTA) for entity classification. All the fine-tuned models are available in HuggingFace, and can be accessed using the following code.
from accord_nlp.text_classification.ner.ner_model import NERModel
model = NERModel('roberta', 'ACCORD-NLP/ner-roberta-large')
predictions, raw_outputs = model.predict(['The gradient of the passageway should not exceed five per cent.'])
print(predictions)
Relation Classification
Relation classification aims to predict the semantic relationship between two entities within a context. We introduced four special tokens (i.e. <e1>, </e1>, <e2> and </e2>) to format the input text with an entity pair to facilitate relation classification. Both <e1> and </e1> mark the start and end of the first entity in the selected text sequence, while <e2> and </e2> mark the start and end of the second entity. The transformer output corresponds to <e1> and <e2> were passed through a softmax layer to predict the relation category.
Our paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations.
Available Models
We fine-tuned three pre-trained transformer models (i.e. BERT, ALBERT and ROBERTA) for relation classification. All the fine-tuned models are available in HuggingFace, and can be accessed using the following code.
from accord_nlp.text_classification.relation_extraction.re_model import REModel
model = REModel('roberta', 'ACCORD-NLP/re-roberta-large')
predictions, raw_outputs = model.predict(['The <e1>gradient<\e1> of the passageway should not exceed <e2>five per cent</e2>.'])
print(predictions)
Information Extraction
Our information extraction pipeline aims to transform a regulatory sentence into a machine-processable output (i.e., a knowledge graph of entities and relations). It utilises the entity and relation classifiers mentioned above to sequentially extract information from the text to build the final graph.
Our paper, listed below, provides more details about the pipeline, including its individual components.
The default pipeline configurations are set to the best-performed entity and relation classification models, and the default pipeline can be accessed using the following code.
from accord_nlp.information_extraction.ie_pipeline import InformationExtractor
sentence = 'The gradient of the passageway should not exceed five per cent.'
ie = InformationExtractor()
ie.sentence_to_graph(sentence)
The following code can be used to access the pipeline with different configurations. Please refer to the ie_pipeline.py for more details about the input parameters.
from accord_nlp.information_extraction.ie_pipeline import InformationExtractor
sentence = 'The gradient of the passageway should not exceed five per cent.'
ner_model_info = ('roberta', 'ACCORD-NLP/ner-roberta-large')
re_model_info = ('roberta', 'ACCORD-NLP/re-roberta-large')
ie = InformationExtractor(ner_model_info=ner_model_info, re_model_info=re_model_info, debug=True)
ie.sentence_to_graph(sentence)
Also, a live demo of the Information Extractor is available in HuggingFace.
Reference
Please note that the corresponding paper for this work is currently in progress and will be made available soon. Thank you for your patience and interest.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file accord_nlp-1.0.0.tar.gz
.
File metadata
- Download URL: accord_nlp-1.0.0.tar.gz
- Upload date:
- Size: 72.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb96f750b9a1534abdb01bd26588d75a21374014d7d27ad76624d626fc6fc2f4 |
|
MD5 | 53a2dd4d93d281da5fc5d9f128be75fe |
|
BLAKE2b-256 | 8cee10e9b07689f6158bc9c9edcab978daadabbffba7e6c171dc40f1c1e9fdee |
File details
Details for the file accord_nlp-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: accord_nlp-1.0.0-py3-none-any.whl
- Upload date:
- Size: 79.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2724a136959c709161acd0dedc87b031edebf1331969a6d90dd7b0dd10d5260 |
|
MD5 | 4c4ad480517a8ef33c946ce7734f7d38 |
|
BLAKE2b-256 | 937f8b9b1b6d89b75c72f4a580155aeb6d1b2f14d2aff99234d67a12a275b941 |