Skip to main content

A learnable pipeline for knowledge acquisition

Project description

KAPipe

KAPipe is a learnable pipeline for knowledge acquisition, with a particular focus on (semi-)automatically complementing knowledge bases in specialized domains.

Features

  • KAPipe provides trained pipelines for end-to-end knowledge graph construction from text.
  • A pipeline is designed as a cascade of the following task components:
    • Named Entity Recognition (NER): Extracting entity mention spans and their entity types from the input text.
    • Entity Disambiguation - Retrieval (ED-Retrieval): Retrieving a set of candidate entity IDs for each given mention in the text, based on a knowledge-base entity pool.
    • Entity Disambiguation - Reranking (ED-Reranking): Reranking the retrieved entity IDs and selecting the most likely entity ID for each given mention in the text.
    • Document-level Relation Extraction (DocRE): Extracting a set of relational triples (head entity, relation, tail entity) for a given entity set.
  • It is possible to use only specific task components.
  • KAPipe uses the state-of-the-art models for each task component.
  • KAPipe also supports training of the pipeline (or specific task components) for new domains, entity types, relation labels, and knowledge bases.
  • This repository also contains the source codes for experiments on custom models, including BERT-based supervised learning models and Large Language Model (LLM)-based In-Context Learning. The following customizable models are implemented for each task:

Installation

python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
pip install kapipe

Data Format: Document

We define a common dictionary format as the input and output for the pipeline. We call this dictionary format "Document". The pipeline is a cascade of the tasks components, and the input and output of each task component is also a Document. The information in the input Document is either passed on to the output Document or updated. Note: As Documents are just dictionary data, users can add their own meta-information, such as information on the correspondence between each word and its position on the PDF, to the Documents, and this information will be retained in the pipeline's output.

Input:

{
    "doc_key": "6794356",
    "sentences": [
        "Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .",
        "A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .",
        "This is the first patient to initially manifest tricuspid regurgitation and atrial flutter , and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy .",
        "Sixty - three percent of these infants had tricuspid valve involvement .",
        "Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy .",
        "It also causes neurologic depression , cyanosis , and cardiac arrhythmia when consumed prior to delivery ."
    ]
}

Output:

{
    "doc_key": "6794356",
    "sentences": [
        "Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .",
        "A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .",
        "This is the first patient to initially manifest tricuspid regurgitation and atrial flutter , and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy .",
        "Sixty - three percent of these infants had tricuspid valve involvement .",
        "Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy .",
        "It also causes neurologic depression , cyanosis , and cardiac arrhythmia when consumed prior to delivery ."
    ],
    "mentions": [
        {
            "span": [
                0,
                2
            ],
            "name": "Tricuspid valve regurgitation",
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "span": [
                4,
                5
            ],
            "name": "lithium carbonate",
            "entity_type": "Chemical",
            "entity_id": "D016651"
        },
        {
            "span": [
                6,
                6
            ],
            "name": "toxicity",
            "entity_type": "Disease",
            "entity_id": "D064420"
        },
        {
            "span": [
                16,
                17
            ],
            "name": "tricuspid regurgitation",
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "span": [
                19,
                20
            ],
            "name": "atrial flutter",
            "entity_type": "Disease",
            "entity_id": "D001282"
        },
        {
            "span": [
                22,
                24
            ],
            "name": "congestive heart failure",
            "entity_type": "Disease",
            "entity_id": "D006333"
        },
        {
            "span": [
                30,
                30
            ],
            "name": "lithium",
            "entity_type": "Chemical",
            "entity_id": "D008094"
        },
        {
            "span": [
                43,
                44
            ],
            "name": "tricuspid regurgitation",
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "span": [
                46,
                47
            ],
            "name": "atrial flutter",
            "entity_type": "Disease",
            "entity_id": "D001282"
        },
        {
            "span": [
                55,
                56
            ],
            "name": "cardiac disease",
            "entity_type": "Disease",
            "entity_id": "D006331"
        },
        {
            "span": [
                61,
                61
            ],
            "name": "lithium",
            "entity_type": "Chemical",
            "entity_id": "D008094"
        },
        {
            "span": [
                82,
                83
            ],
            "name": "Lithium carbonate",
            "entity_type": "Chemical",
            "entity_id": "D016651"
        },
        {
            "span": [
                93,
                95
            ],
            "name": "congenital heart disease",
            "entity_type": "Disease",
            "entity_id": "D006331"
        },
        {
            "span": [
                105,
                106
            ],
            "name": "neurologic depression",
            "entity_type": "Disease",
            "entity_id": "D003866"
        },
        {
            "span": [
                108,
                108
            ],
            "name": "cyanosis",
            "entity_type": "Disease",
            "entity_id": "D003490"
        },
        {
            "span": [
                111,
                112
            ],
            "name": "cardiac arrhythmia",
            "entity_type": "Disease",
            "entity_id": "D001145"
        }
    ],
    "entities": [
        {
            "mention_indices": [
                0,
                3,
                7
            ],
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "mention_indices": [
                1,
                11
            ],
            "entity_type": "Chemical",
            "entity_id": "D016651"
        },
        {
            "mention_indices": [
                2
            ],
            "entity_type": "Disease",
            "entity_id": "D064420"
        },
        {
            "mention_indices": [
                4,
                8
            ],
            "entity_type": "Disease",
            "entity_id": "D001282"
        },
        {
            "mention_indices": [
                5
            ],
            "entity_type": "Disease",
            "entity_id": "D006333"
        },
        {
            "mention_indices": [
                6,
                10
            ],
            "entity_type": "Chemical",
            "entity_id": "D008094"
        },
        {
            "mention_indices": [
                9,
                12
            ],
            "entity_type": "Disease",
            "entity_id": "D006331"
        },
        {
            "mention_indices": [
                13
            ],
            "entity_type": "Disease",
            "entity_id": "D003866"
        },
        {
            "mention_indices": [
                14
            ],
            "entity_type": "Disease",
            "entity_id": "D003490"
        },
        {
            "mention_indices": [
                15
            ],
            "entity_type": "Disease",
            "entity_id": "D001145"
        }
    ],
    "relations": [
        {
            "arg1": 1,
            "relation": "CID",
            "arg2": 7
        },
        {
            "arg1": 1,
            "relation": "CID",
            "arg2": 8
        },
        {
            "arg1": 1,
            "relation": "CID",
            "arg2": 9
        }
    ]
}

Downloading Trained Pipelines

Trained pipelines can be downloaded from https://drive.google.com/drive/folders/16ypMCoLYf5kDxglDD_NYoCNAfhTy4Qwp.

Download the latest compressed file release.YYYYMMDD.tar.gz, and then unzip it in ~/.kapipe directory as follows:

mkdir ~/.kapipe
mv release.YYYYMMDD.tar.gz ~/.kapipe
cd ~/.kapipe
tar -zxvf release.YYYYMMDD.tar.gz

Loading and Using Pipeline

The easiest way to apply the knowledge acquisition pipeline (i.e., the cascade of NER, ED, and DocRE tasks) to an input document is to load the pipeline using kapipe.load() and just apply it to the document.

import kapipe
ka = kapipe.load("cdr_biaffinener_blink_atlop")
document = ka(document)

The above code loads and uses models that have already been trained for specific domains, entity types (in NER), knowledge bases (in ED), and relation labels (in DocRE). Specifically, the identifier "cdr_biaffinener_blink_atlop" above indicates that the Biaffine-NER, BLINK, and ATLOP models trained on the CDR dataset (biomedical abstracts, Chemical and Disease entity types, entity IDs based on the MeSH ontology, and Chemical-Induce-Disease relation label) are used for NER, ED, and DocRE, respectively.

It is also possible to apply specific tasks by directly calling the task components. For example, if you would like to perform only NER and ED, please do the following.

import kapipe
ka = kapipe.load("cdr_biaffinener_blink_atlop")

# NER
document = ka.ner(document)
# ED-Retrieval
document, candidate_entities = ka.ed_ret(document, num_candidate_entities=10)
# ED-Reranking
document = ka.ed_rank(document, candidate_entities)

Also, for example, if mentions and entities have already been annotated (by humans or external systems), and if you would like to perform only DocRE, do the following. Note that the mentions and entities have already been integrated into the input document.

import kapipe
ka = kapipe.load("cdr_biaffinener_blink_atlop")

# DocRE
document = ka.docre(document_with_gold_mentions_and_entities)

Available Trained Pipelines

The following pipelines are currently available.

identifier NER Model and Dataset (Entity Types) ED-Retrieval Model and Dataset with Knowledge Base ED-Reranking Model and Dataset with Knowledge Base DocRE Model and Dataset (Relation Labels)
cdr_biaffinener_blink_atlop Biaffine-NER on CDR (Chemical, Disease) BLINK Bi-Encoder on CDR + MeSH (2015) BLINK Cross-Encoder on CDR + MeSH (2015) ATLOP on CDR (Chemical-Induce-Disease)

Training

If the trained pipelines do not cover your target domain, entity types, knowledge base, or relation labels, please train each task component in the pipeline on your dataset. Once you have trained the pipeline, please save it for future reuse. You can set your own identifier.

import kapipe

ka = kapipe.blank(gpu_map={"ner":0, "ed_retrieval":1, "ed_reranking": 2, "docre": 3})

# NER
ka.ner.fit(
    train_documents,
    dev_documents,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "allow_nested_entities": True,
        "max_epoch": 10,
    }
)

# ED-Retrieval
ka.ed_ret.fit(
    entity_dict,
    train_documents,
    dev_documents,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "max_epoch": 10,
    }
)

# ED-Reranking
train_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in train_documents
]
dev_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in dev_documents
]
ka.ed_rank.fit(
    entity_dict,
    train_documents, train_candidate_entities,
    dev_documents, dev_candidate_entities,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "max_epoch": 10,
    }
)

# DocRE
ka.docre.fit(
    train_documents,
    dev_documents,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "max_epoch": 10,
        "possible_head_entity_types": ["Chemical"], # or None
        "possible_tail_entity_types": ["Disease"], # or None
    }
)
 
ka.save("your favorite identifier")

If you would like to train only specific task components (for example, if you would like to use models trained on CDR for NER and DocRE, and train a new model on a different version of MeSH for ED), please do the following.

import kapipe

ka = kapipe.load("cdr_biaffinener_blink_atlop")

# ED-Retrieval
ka.ed_ret.fit(
    entity_dict,
    train_documents,
    dev_documents
)

# ED-Reranking
train_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in train_documents
]
dev_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in dev_documents
]
ka.ed_rank.fit(
    entity_dict,
    train_documents, train_candidate_entities,
    dev_documents, dev_candidate_entities
)

ka.save("your favorite identifier")

When using the saved pipeline, please load it by specifying the identifier.

import kapipe

kapipe.load("your favorite identifier")

Experiments on Custom Models

The pipeline is a top-level wrapper class that consists of a cascade of task components, and each task component is also a black box class, in which specific models (e.g., Biaffine-NER, ATLOP) are used. In order to perform various training, evaluation, and analysis on specific methods, it may be more intuitive to directly instantiate each method (hereafter referred to as a "system") rather than the pipeline.

The core of KAPipe is the systems, and the pipeline is just a wrapper to make them easy to use with minimal coding. If you are familiar with coding and your goal is not just to apply the KA pipeline, but also to develop the methods, it would be better to work directly with the systems rather than using the pipeline.

The fastest way to find out how to initialize, train and evaluate each system is to look at the experiments/codes/run_* scripts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kapipe-0.0.2.tar.gz (93.5 kB view details)

Uploaded Source

Built Distribution

kapipe-0.0.2-py3-none-any.whl (142.0 kB view details)

Uploaded Python 3

File details

Details for the file kapipe-0.0.2.tar.gz.

File metadata

  • Download URL: kapipe-0.0.2.tar.gz
  • Upload date:
  • Size: 93.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for kapipe-0.0.2.tar.gz
Algorithm Hash digest
SHA256 cfd9b308328b88bc8a5a63b0212920cf382d8d55656465e8cd82bbcfd3449ff3
MD5 5a2c3991e4658eb79be7904653999103
BLAKE2b-256 ce4d4496cad2b16fd76e6142ab4c250e22a8a338d46e0f4482465802f2acfa96

See more details on using hashes here.

Provenance

File details

Details for the file kapipe-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: kapipe-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 142.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for kapipe-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1f3d53671ea24a03728690db1ce370955688558a674eb21b2d6b47965d0b867d
MD5 daa63410dd00a31806ac76887e1a3ce5
BLAKE2b-256 0a9ff6c7ddeb49d77f4c70549c8f05f1fbc96cf1c504516e3c47755fdd0976eb

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page