OntoGPT

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

OntoGPT

Generation of Ontologies and Knowledge Bases using GPT

A knowledge extraction tool that uses a large language model to extract semantic information from text.

This makes use of so-called instruction prompts in Large Language Models (LLMs) such as GPT-4.

Currently there are two different pipelines implemented:

SPIRES: Structured Prompt Interrogation and Recursive Extraction of Semantics
- Zero-shot learning approach to extracting nested semantic structures from text
- Inputs: LinkML schema + text
- Outputs: JSON, YAML, or RDF or OWL that conforms to the schema
- Uses text-davinci-003
HALO: HAllucinating Latent Ontologies
- Few-shot learning approach to generating/hallucinating a domain ontology given a few examples
- Uses code-davinci-002

SPIRES: Usage

Given a short text abstract.txt with content such as:

The cGAS/STING-mediated DNA-sensing signaling pathway is crucial for interferon (IFN) production and host antiviral responses

... [snip] ...

The underlying mechanism was the interaction of US3 with β-catenin and its hyperphosphorylation of β-catenin at Thr556 to block its nuclear translocation ... ...

(see full input)

We can extract this into the GO pathway datamodel:

ontogpt extract -t gocam.GoCamAnnotations abstract.txt

Giving schema-compliant yaml such as:

genes:
- HGNC:2514
- HGNC:21367
- HGNC:27962
- US3
- FPLX:Interferon
- ISG
gene_gene_interactions:
- gene1: US3
  gene2: HGNC:2514
gene_localizations:
- gene: HGNC:2514
  location: Nuclear
gene_functions:
- gene: HGNC:2514
  molecular_activity: Transcription
- gene: HGNC:21367
  molecular_activity: Production
...

See full output

Note in the above the grounding is very preliminary and can be improved. Ungrounded NamedEntities appear as text.

How it works

You provide an arbitrary data model, describing the structure you want to extract text into
- this can be nested (but see limitations below)
provide your preferred annotations for grounding NamedEntity fields
ontogpt will:
- generate a prompt
- feed the prompt to a language model (currently OpenAI)
- parse the results into a dictionary structure
- ground the results using a preferred annotator

Pre-requisites

python 3.9+
an OpenAI account
a BioPortal account (optional, for grounding)

You will need to set both API keys using OAK (which is a dependency of this project)

poetry run runoak set-apikey -e openai <your openai api key>
poetry run runoak set-apikey -e bioportal <your bioportal api key>

How to define your own extraction data model

Step 1: Define a schema

See src/ontogpt/templates/ for examples.

Define a schema (using a subset of LinkML) that describes the structure you want to extract from your text.

classes:
  MendelianDisease:
    attributes:
      name:
        description: the name of the disease
        examples:
          - value: peroxisome biogenesis disorder
        identifier: true  ## needed for inlining
      description:
        description: a description of the disease
        examples:
          - value: >-
             Peroxisome biogenesis disorders, Zellweger syndrome spectrum (PBD-ZSS) is a group of autosomal recessive disorders affecting the formation of functional peroxisomes, characterized by sensorineural hearing loss, pigmentary retinal degeneration, multiple organ dysfunction and psychomotor impairment
      synonyms:
        multivalued: true
        examples:
          - value: Zellweger syndrome spectrum
          - value: PBD-ZSS
      subclass_of:
        multivalued: true
        range: MendelianDisease
        examples:
          - value: lysosomal disease
          - value: autosomal recessive disorder
      symptoms:
        range: Symptom
        multivalued: true
        examples:
          - value: sensorineural hearing loss
          - value: pigmentary retinal degeneration
      inheritance:
        range: Inheritance
        examples:
          - value: autosomal recessive
      genes:
        range: Gene
        multivalued: true
        examples:
          - value: PEX1
          - value: PEX2
          - value: PEX3

  Gene:
    is_a: NamedThing
    id_prefixes:
      - HGNC
    annotations:
      annotators: gilda:, bioportal:hgnc-nr

  Symptom:
    is_a: NamedThing
    id_prefixes:
      - HP
    annotations:
      annotators: sqlite:obo:hp

  Inheritance:
    is_a: NamedThing
    annotations:
      annotators: sqlite:obo:hp

the schema is defined in LinkML
prompt hints can be specified using the prompt annotation (otherwise description is used)
multivalued fields are supported
the default range is string - these are not grounded. E.g. disease name, synonyms
define a class for each NamedEntity
for any NamedEntity, you can specify a preferred annotator using the annotators annotation

We recommend following an established schema like Biolink, but you can define your own.

Step 2: Compile the schema

Run the make command at the top level. This will compile the schema to pedantic

Step 3: Run the command line

e.g.

ontogpt extract -t mendelian_disease.MendelianDisease marfan-wikipedia.txt

Web Application

There is a bare bones web application

poetry run web-ontogpt

Note that the agent running uvicorn must have the API key set, so for obvious reasons don't host this publicly without authentication, unless you want your credits drained.

Features

Multiple levels of nesting

Currently no more than two levels of nesting are recommended.

If a field has a range which is itself a class and not a primitive, it will attempt to nest

E.g. the gocam schema has an attribute:

  attributes:
      ...
      gene_functions:
        description: semicolon-separated list of gene to molecular activity relationships
        multivalued: true
        range: GeneMolecularActivityRelationship

Because GeneMolecularActivityRelationship is inlined it will nest

The generated prompt is:

gene_functions : <semicolon-separated list of gene to molecular activities relationships>

The output of this is then passed through further SPIRES iterations.

Text length limit

Currently SPIRES must use text-davinci-003, which has a total 4k token limit (prompt + completion).

You can pass in a parameter to split the text into chunks, results will be recombined automatically, but more experiments need to be done to determined how reliable this is.

Schema Tips

It helps to have an understanding of the LinkML schema language, but it should be possible to define your own schemas using the examples as a guide.

OntoGPT-specific extensions are specified as annotations

You can specify a set of annotators for a field using the annotators annotation

  Gene:
    is_a: NamedThing
    id_prefixes:
      - HGNC
    annotations:
      annotators: gilda:, bioportal:hgnc-nr, obo:pr

The annotators are applied in order.

Additionally, when performing grounding, the following measures can be taken to improve accuracy:

specify the valid set of ID prefixes using id_prefixes
some vocabularies have structural IDs that are amenable to regexes, you can specify these using pattern
you can make use of values_from slot to specify a Dynamic Value Set
- for example, you can constrain the set of valid locations for a gene product to be subclasses of cellular_component in GO or cell in CL.

For example:

classes:
  ...
  GeneLocation:
    is_a: NamedEntity
    id_prefixes:
      - GO
      - CL
    annotations:
      annotators: "sqlite:obo:go, sqlite:obo:cl"
    slot_usage:
      id:
        values_from:
          - GOCellComponentType
          - CellType

enums:

  GOCellComponentType:
    reachable_from:
      source_ontology: obo:go
      source_nodes:
        - GO:0005575 ## cellular_component
  CellType:
    reachable_from:
      source_ontology: obo:cl
      source_nodes:
        - CL:0000000 ## cell

OWL Exports

The extract command will let you export the results as OWL axioms, utilizing linkml-owl mappings in the schema.

For example:

ontogpt extract -t recipe recipe-spaghetti.txt -o recipe-spaghetti.owl -O owl

See src/ontogpt/templates/recipe.yaml for an example of a schema that uses linkml-owl mappings.

See the Makefile for a full pipeline that involves using robot to extract a subset of FOODON and merge in the extracted results.

HALO: Usage

TODO

OntoGPT Limitations

Non-deterministic

This relies on an existing LLM, and LLMs can be fickle in their responses.

Coupled to OpenAI

You will need an openai account. In theory any LLM can be used but in practice the parser is tuned for OpenAI

Acknowledgements

This cookiecutter project was developed from the sphintoxetry-cookiecutter template and will be kept up-to-date using cruft.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.11

Apr 14, 2024

0.3.10

Apr 8, 2024

0.3.9

Mar 20, 2024

0.3.8

Feb 8, 2024

0.3.7

Jan 19, 2024

0.3.6

Dec 20, 2023

0.3.5

Dec 14, 2023

0.3.4

Nov 21, 2023

0.3.3

Sep 25, 2023

0.3.2

Sep 19, 2023

0.3.1

Aug 24, 2023

0.2.10

Jul 23, 2023

0.2.9

May 31, 2023

0.2.8

May 31, 2023

0.2.7

May 18, 2023

0.2.6

May 15, 2023

0.2.5

May 12, 2023

0.2.4

May 4, 2023

0.2.3

May 2, 2023

0.2.2

Apr 21, 2023

0.2.1

Apr 6, 2023

This version

0.2.0

Mar 23, 2023

0.1.1

Jan 5, 2023

0.0.0

Jan 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ontogpt-0.2.0.tar.gz (945.5 kB view hashes)

Uploaded Mar 23, 2023 Source

Built Distribution

ontogpt-0.2.0-py3-none-any.whl (978.6 kB view hashes)

Uploaded Mar 23, 2023 Python 3

Hashes for ontogpt-0.2.0.tar.gz

Hashes for ontogpt-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`71c56dcf05fe125cd0478c5b2cd19c8be73fdb614d09ace18b43752b31864e8d`
MD5	`986442c42c906b617f630d23fac4c129`
BLAKE2b-256	`26d6fb298b0cadb342c2d91f4fb3e0630f98502ab03d2712277e0ea30c5c75ec`

Hashes for ontogpt-0.2.0-py3-none-any.whl

Hashes for ontogpt-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e562f2b7e4a0b01424c8a69b209063f5f5a393480198fd5c477c53b5a3ec70a1`
MD5	`1bf50eea62bd13ff26079d4984dadd8a`
BLAKE2b-256	`fd18eec8e2ccfa667c15f8d474280092ec5dec74ab8cf149eb26bdd1150c110c`