Skip to main content

Knowledge Graph Extraction for SFM dataset

Project description

Knowledge Graph Extraction

We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.

This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.

Example: Example

Dependencies

Tensorflow 2.2.0
Tensorflow-addons
SpaCy
NumPy
DyNet
Pathlib

Install

Package: https://pypi.org/project/extract-sfm/

$ pip install extract_sfm

Usage

Method 1

Create a python file and write:

import extract_sfm

extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")

Then run the python file. This may take a while to finish.

Method 2

Download this Github repository Under the project root directory, run the python script

$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES

Note: Use absolute path.

Website

  1. Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
  2. Install npm dependencies under the "SERVER" directory: express, path, multer
  $ npm install <package name>
  1. Run the server by typing in:
  $ node server.js

Example

Environment Setup

tensorflow 2.2.0
  pip install tensorflow
  pip install tensorflow-addons

spaCy (macOS)
  pip install -U spacy
  python3 -m spacy download en_core_web_sm

DyNet
  pip install dynet

pathlib
  pip install pathlib

NER Documentation

TRAINING
  Dataset:
    1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
    2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
    3. A set of known organizations from the starter dataset
    Note: Title and role were collapsed into one class

  Usage:
    1) Prepare data
      $ python process.py
      $ cd SFM_STARTER
      $ python build_vocab.py
      $ python build_glove.py
      $ cd ..

    2) Train model
      $ python train.py

    3) Make predictions
      $ python pred.py

    4) Evaluate model
      $ python eval.py
      $ python eval_class.py

  Files:
    process.py: 1) preprocess dataset by recording info in dicts,
                      which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
                2) convert SFM starter dataset to a format that can be used by the model,
                      which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
    pred.py: generates predictions using the trained model
    eval.py: evaluate the predctions made by model, which are generated by running pred.py
    eval_class.py: get precision, recall and f1 score for each class

    Other files are from https://github.com/guillaumegenthial/tf_ner
      train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py

PREDICTING
  Usage:
    $ python ner.py <doc_id>.txt

  File:
    ner.py: get BRAT format prediction for a text file.

RE Documentation

jPTDP:
  Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
  Usage: Go to the jPTDP directory and run
    $ python fast_parse.py <path_to_txt>.txt
  The output will be put along side with the input text file in a directory whose name is same as the text file.



--- METHOD 1: nearest person:
    Assign the non-person name entities to the nearest person that is behind the name entities.

    Usage:
      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_np.py <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        Set "output_dir" in pipeline.sh to <directory> and run:
        $ source pipeline.sh



--- METHOD 2: dependency parsing
    Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
    Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious

    Usage:
      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        $ source pipeline.sh <directory>



--- METHOD 3: neural networks
    Use dependency path and its distance as features to predict which person in the sentence is the best option
    The best model is saved in "model_86.h5"

    Usage:
      Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
      $ python pred.py

Project details


Release history Release notifications | RSS feed

This version

2.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_sfm-2.0.tar.gz (31.9 kB view hashes)

Uploaded Source

Built Distribution

extract_sfm-2.0-py3-none-any.whl (129.5 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page