Skip to main content

Knowledge Graph Extraction for SFM dataset

Project description

Knowledge Graph Extraction

We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.

This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.

Example: Example

Dependencies

Tensorflow 2.2.0
Tensorflow-addons
SpaCy
NumPy
DyNet
Pathlib

Install

Package: https://pypi.org/project/extract-sfm/

$ pip install extract_sfm

Usage

Method 1

Create a python file and write:

import extract_sfm

extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")

Then run the python file. This may take a while to finish.

Method 2

Download this Github repository Under the project root directory, run the python script

$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES

Note: Use absolute path.

Website

  1. Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
  2. Install npm dependencies under the "SERVER" directory: express, path, multer
  $ npm install <package name>
  1. Run the server by typing in:
  $ node server.js

Example

Environment Setup

tensorflow 2.2.0
  pip install tensorflow
  pip install tensorflow-addons

spaCy (macOS)
  pip install -U spacy
  python3 -m spacy download en_core_web_sm

DyNet
  pip install dynet

pathlib
  pip install pathlib

NER Documentation

TRAINING
  Dataset:
    1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
    2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
    3. A set of known organizations from the starter dataset
    Note: Title and role were collapsed into one class

  Usage:
    1) Prepare data
      $ python process.py
      $ cd SFM_STARTER
      $ python build_vocab.py
      $ python build_glove.py
      $ cd ..

    2) Train model
      $ python train.py

    3) Make predictions
      $ python pred.py

    4) Evaluate model
      $ python eval.py
      $ python eval_class.py

  Files:
    process.py: 1) preprocess dataset by recording info in dicts,
                      which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
                2) convert SFM starter dataset to a format that can be used by the model,
                      which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
    pred.py: generates predictions using the trained model
    eval.py: evaluate the predctions made by model, which are generated by running pred.py
    eval_class.py: get precision, recall and f1 score for each class

    Other files are from https://github.com/guillaumegenthial/tf_ner
      train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py

PREDICTING
  Usage:
    $ python ner.py <doc_id>.txt

  File:
    ner.py: get BRAT format prediction for a text file.

RE Documentation

jPTDP:
  Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
  Usage: Go to the jPTDP directory and run
    $ python fast_parse.py <path_to_txt>.txt
  The output will be put along side with the input text file in a directory whose name is same as the text file.



--- METHOD 1: nearest person:
    Assign the non-person name entities to the nearest person that is behind the name entities.

    Usage:
      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_np.py <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        Set "output_dir" in pipeline.sh to <directory> and run:
        $ source pipeline.sh



--- METHOD 2: dependency parsing
    Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
    Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious

    Usage:
      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        $ source pipeline.sh <directory>



--- METHOD 3: neural networks
    Use dependency path and its distance as features to predict which person in the sentence is the best option
    The best model is saved in "model_86.h5"

    Usage:
      Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
      $ python pred.py

Project details


Release history Release notifications | RSS feed

This version

2.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_sfm-2.0.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

extract_sfm-2.0-py3-none-any.whl (129.5 MB view details)

Uploaded Python 3

File details

Details for the file extract_sfm-2.0.tar.gz.

File metadata

  • Download URL: extract_sfm-2.0.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.0

File hashes

Hashes for extract_sfm-2.0.tar.gz
Algorithm Hash digest
SHA256 b84fa13c4119f3dd9774cdf2d6a6e0fd851d949cccc50af4e64c6f96f3d5e3a8
MD5 e5f15c78bdb0b9a93fb508ee6339b77e
BLAKE2b-256 19e08e9514910a2ca5ef596291cfa4ae7cfc08ce7e0809057ad2c259bfcf554e

See more details on using hashes here.

File details

Details for the file extract_sfm-2.0-py3-none-any.whl.

File metadata

  • Download URL: extract_sfm-2.0-py3-none-any.whl
  • Upload date:
  • Size: 129.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.0

File hashes

Hashes for extract_sfm-2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 538611bb5e880944f46c2bec94537de04eab310aa0041780db39814ef3a69474
MD5 5e68fc1705b14fa7a14a4516ee27e4a1
BLAKE2b-256 d2d751bbfaa10b9defac81e072ce320d24a9255f02222a7f20b2afbe1b0958e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page