Knowledge Graph Extraction for SFM dataset
Project description
Knowledge Graph Extraction
We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.
This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.
Example:
Dependencies
Tensorflow 2.2.0
Tensorflow-addons
SpaCy
NumPy
DyNet
Pathlib
Install
Package: https://pypi.org/project/extract-sfm/
$ pip install extract_sfm
Usage
Method 1
Create a python file and write:
import extract_sfm
extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")
Then run the python file. This may take a while to finish.
Method 2
Download this Github repository Under the project root directory, run the python script
$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES
Note: Use absolute path.
Website
- Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
- Install npm dependencies under the "SERVER" directory: express, path, multer
$ npm install <package name>
- Run the server by typing in:
$ node server.js
Environment Setup
tensorflow 2.2.0
pip install tensorflow
pip install tensorflow-addons
spaCy (macOS)
pip install -U spacy
python3 -m spacy download en_core_web_sm
DyNet
pip install dynet
pathlib
pip install pathlib
NER Documentation
TRAINING
Dataset:
1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
3. A set of known organizations from the starter dataset
Note: Title and role were collapsed into one class
Usage:
1) Prepare data
$ python process.py
$ cd SFM_STARTER
$ python build_vocab.py
$ python build_glove.py
$ cd ..
2) Train model
$ python train.py
3) Make predictions
$ python pred.py
4) Evaluate model
$ python eval.py
$ python eval_class.py
Files:
process.py: 1) preprocess dataset by recording info in dicts,
which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
2) convert SFM starter dataset to a format that can be used by the model,
which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
pred.py: generates predictions using the trained model
eval.py: evaluate the predctions made by model, which are generated by running pred.py
eval_class.py: get precision, recall and f1 score for each class
Other files are from https://github.com/guillaumegenthial/tf_ner
train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py
PREDICTING
Usage:
$ python ner.py <doc_id>.txt
File:
ner.py: get BRAT format prediction for a text file.
RE Documentation
jPTDP:
Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
Usage: Go to the jPTDP directory and run
$ python fast_parse.py <path_to_txt>.txt
The output will be put along side with the input text file in a directory whose name is same as the text file.
--- METHOD 1: nearest person:
Assign the non-person name entities to the nearest person that is behind the name entities.
Usage:
1. To extraction relations in a single text file:
(extracted relations will be appended to the .ann file)
$ python relation_np.py <doc_id>.txt <doc_id>.ann
2. To generate annotations for a set of text file under <directory>
Set "output_dir" in pipeline.sh to <directory> and run:
$ source pipeline.sh
--- METHOD 2: dependency parsing
Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious
Usage:
1. To extraction relations in a single text file:
(extracted relations will be appended to the .ann file)
$ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
2. To generate annotations for a set of text file under <directory>
$ source pipeline.sh <directory>
--- METHOD 3: neural networks
Use dependency path and its distance as features to predict which person in the sentence is the best option
The best model is saved in "model_86.h5"
Usage:
Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
$ python pred.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for extract_sfm-2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 538611bb5e880944f46c2bec94537de04eab310aa0041780db39814ef3a69474 |
|
MD5 | 5e68fc1705b14fa7a14a4516ee27e4a1 |
|
BLAKE2b-256 | d2d751bbfaa10b9defac81e072ce320d24a9255f02222a7f20b2afbe1b0958e9 |