Skip to main content

Knowledge Graph Extraction for SFM dataset

Project description

Knowledge Graph Extraction

We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.

This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.

Example: Example


Tensorflow 2.2.0



$ pip install extract_sfm


Method 1

Create a python file and write:

import extract_sfm


Then run the python file. This may take a while to finish.

Method 2

Download this Github repository Under the project root directory, run the python script


Note: Use absolute path.


  1. Copy NER_v2, RE, into the "SERVER/KGE" directory
  2. Install npm dependencies under the "SERVER" directory: express, path, multer
  $ npm install <package name>
  1. Run the server by typing in:
  $ node server.js


Environment Setup

tensorflow 2.2.0
  pip install tensorflow
  pip install tensorflow-addons

spaCy (macOS)
  pip install -U spacy
  python3 -m spacy download en_core_web_sm

  pip install dynet

  pip install pathlib

NER Documentation

    1. SFM starter dataset:
    2. CONLL2003:
    3. A set of known organizations from the starter dataset
    Note: Title and role were collapsed into one class

    1) Prepare data
      $ python
      $ cd SFM_STARTER
      $ python
      $ python
      $ cd ..

    2) Train model
      $ python

    3) Make predictions
      $ python

    4) Evaluate model
      $ python
      $ python

  Files: 1) preprocess dataset by recording info in dicts,
                      which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
                2) convert SFM starter dataset to a format that can be used by the model,
                      which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test. generates predictions using the trained model evaluate the predctions made by model, which are generated by running get precision, recall and f1 score for each class

    Other files are from,, SFM_STARTER/, SFM_STARTER/

    $ python <doc_id>.txt

  File: get BRAT format prediction for a text file.

RE Documentation

  Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
  Usage: Go to the jPTDP directory and run
    $ python <path_to_txt>.txt
  The output will be put along side with the input text file in a directory whose name is same as the text file.

--- METHOD 1: nearest person:
    Assign the non-person name entities to the nearest person that is behind the name entities.

      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        Set "output_dir" in to <directory> and run:
        $ source

--- METHOD 2: dependency parsing
    Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
    Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious

      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        $ source <directory>

--- METHOD 3: neural networks
    Use dependency path and its distance as features to predict which person in the sentence is the best option
    The best model is saved in "model_86.h5"

      Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in
      $ python

Project details

Release history Release notifications | RSS feed

This version


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_sfm-2.0.tar.gz (31.9 kB view hashes)

Uploaded Source

Built Distribution

extract_sfm-2.0-py3-none-any.whl (129.5 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page