Knowledge Graph Extraction for SFM dataset
Project description
Knowledge Graph Extraction
We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.
This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.
Example:
Dependencies
Tensorflow 2.2.0
Tensorflow-addons
SpaCy
NumPy
DyNet
Pathlib
Install
Package: https://pypi.org/project/extract-sfm/
$ pip install extract_sfm
Usage
Method 1
Create a python file and write:
import extract_sfm
extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")
Then run the python file. This may take a while to finish.
Method 2
Download this Github repository Under the project root directory, run the python script
$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES
Note: Use absolute path.
Website
- Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
- Install npm dependencies under the "SERVER" directory: express, path, multer
$ npm install <package name>
- Run the server by typing in:
$ node server.js
Environment Setup
tensorflow 2.2.0
pip install tensorflow
pip install tensorflow-addons
spaCy (macOS)
pip install -U spacy
python3 -m spacy download en_core_web_sm
DyNet
pip install dynet
pathlib
pip install pathlib
NER Documentation
TRAINING
Dataset:
1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
3. A set of known organizations from the starter dataset
Note: Title and role were collapsed into one class
Usage:
1) Prepare data
$ python process.py
$ cd SFM_STARTER
$ python build_vocab.py
$ python build_glove.py
$ cd ..
2) Train model
$ python train.py
3) Make predictions
$ python pred.py
4) Evaluate model
$ python eval.py
$ python eval_class.py
Files:
process.py: 1) preprocess dataset by recording info in dicts,
which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
2) convert SFM starter dataset to a format that can be used by the model,
which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
pred.py: generates predictions using the trained model
eval.py: evaluate the predctions made by model, which are generated by running pred.py
eval_class.py: get precision, recall and f1 score for each class
Other files are from https://github.com/guillaumegenthial/tf_ner
train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py
PREDICTING
Usage:
$ python ner.py <doc_id>.txt
File:
ner.py: get BRAT format prediction for a text file.
RE Documentation
jPTDP:
Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
Usage: Go to the jPTDP directory and run
$ python fast_parse.py <path_to_txt>.txt
The output will be put along side with the input text file in a directory whose name is same as the text file.
--- METHOD 1: nearest person:
Assign the non-person name entities to the nearest person that is behind the name entities.
Usage:
1. To extraction relations in a single text file:
(extracted relations will be appended to the .ann file)
$ python relation_np.py <doc_id>.txt <doc_id>.ann
2. To generate annotations for a set of text file under <directory>
Set "output_dir" in pipeline.sh to <directory> and run:
$ source pipeline.sh
--- METHOD 2: dependency parsing
Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious
Usage:
1. To extraction relations in a single text file:
(extracted relations will be appended to the .ann file)
$ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
2. To generate annotations for a set of text file under <directory>
$ source pipeline.sh <directory>
--- METHOD 3: neural networks
Use dependency path and its distance as features to predict which person in the sentence is the best option
The best model is saved in "model_86.h5"
Usage:
Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
$ python pred.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file extract_sfm-2.0.tar.gz
.
File metadata
- Download URL: extract_sfm-2.0.tar.gz
- Upload date:
- Size: 31.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
b84fa13c4119f3dd9774cdf2d6a6e0fd851d949cccc50af4e64c6f96f3d5e3a8
|
|
MD5 |
e5f15c78bdb0b9a93fb508ee6339b77e
|
|
BLAKE2b-256 |
19e08e9514910a2ca5ef596291cfa4ae7cfc08ce7e0809057ad2c259bfcf554e
|
File details
Details for the file extract_sfm-2.0-py3-none-any.whl
.
File metadata
- Download URL: extract_sfm-2.0-py3-none-any.whl
- Upload date:
- Size: 129.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
538611bb5e880944f46c2bec94537de04eab310aa0041780db39814ef3a69474
|
|
MD5 |
5e68fc1705b14fa7a14a4516ee27e4a1
|
|
BLAKE2b-256 |
d2d751bbfaa10b9defac81e072ce320d24a9255f02222a7f20b2afbe1b0958e9
|