Skip to main content

Extract a knowledge graph using LLMs from any text or messages array

Project description

kg-gen: Knowledge Graph Generation from Any Text

Welcome! kg-gen helps you extract knowledge graphs from any plain text using AI. It can process both small and large text inputs, and it can also handle messages in a conversation format.

Why generate knowledge graphs? kg-gen is great if you want to:

  • Create a graph to assist with RAG (Retrieval-Augmented Generation)
  • Create graph synthetic data for model training and testing
  • Structure any text into a graph
  • Analyze the relationships between concepts in your source text

We support API-based and local model providers via LiteLLM, including OpenAI, Ollama, Anthropic, Gemini, Deepseek, and others. We also use DSPy for structured output generation.

Powered by a model of your choice

Pass in a model string to use a model of your choice. Model calls are routed via LiteLLM, and usually LiteLLM goes by the format of {model_provider}/{model_name}. See specifically how to format it at https://docs.litellm.ai/docs/providers.

Examples of models you can pass in:

  • openai/gpt-4o
  • gemini/gemini-2.0-flash
  • ollama_chat/deepseek-r1:14b

You may specify a custom API base url with base_url (example here).

Quick start

Install the module:

pip install kg-gen

Then import and use kg-gen. You can provide your text input in one of two formats:

  1. A single string
  2. A list of Message objects (each with a role and content)

Below are some example snippets:

from kg_gen import KGGen

# Initialize KGGen with optional configuration
kg = KGGen(
  model="openai/gpt-4o",  # Default model
  temperature=0.0,        # Default temperature
  api_key="YOUR_API_KEY"  # Optional if set in environment or using a local model
)

# EXAMPLE 1: Single string with context
text_input = "Linda is Josh's mother. Ben is Josh's brother. Andrew is Josh's father."
graph_1 = kg.generate(
  input_data=text_input,
  context="Family relationships"
)
# Output: 
# entities={'Linda', 'Ben', 'Andrew', 'Josh'} 
# edges={'is brother of', 'is father of', 'is mother of'} 
# relations={('Ben', 'is brother of', 'Josh'), 
#           ('Andrew', 'is father of', 'Josh'), 
#           ('Linda', 'is mother of', 'Josh')}

Visualizing KG's

KGGen.visualize(graph, output_path, open_in_browser=True)

viz-tool

More Examples - chunking, clustering, passing in a messages array

# EXAMPLE 2: Large text with chunking and clustering
with open('large_text.txt', 'r') as f:
  large_text = f.read()
  
# Example input text:
# """
# Neural networks are a type of machine learning model. Deep learning is a subset of machine learning
# that uses multiple layers of neural networks. Supervised learning requires training data to learn
# patterns. Machine learning is a type of AI technology that enables computers to learn from data.
# AI, also known as artificial intelligence, is related to the broader field of artificial intelligence.
# Neural nets (NN) are commonly used in ML applications. Machine learning (ML) has revolutionized
# many fields of study.
# ...
# """

graph_2 = kg.generate(
  input_data=large_text,
  chunk_size=5000,  # Process text in chunks of 5000 chars
  cluster=True      # Cluster similar entities and relations
)
# Output:
# entities={'neural networks', 'deep learning', 'machine learning', 'AI', 'artificial intelligence', 
#          'supervised learning', 'unsupervised learning', 'training data', ...} 
# edges={'is type of', 'requires', 'is subset of', 'uses', 'is related to', ...} 
# relations={('neural networks', 'is type of', 'machine learning'),
#           ('deep learning', 'is subset of', 'machine learning'),
#           ('supervised learning', 'requires', 'training data'),
#           ('machine learning', 'is type of', 'AI'),
#           ('AI', 'is related to', 'artificial intelligence'), ...}
# entity_clusters={
#   'artificial intelligence': {'AI', 'artificial intelligence'},
#   'machine learning': {'machine learning', 'ML'},
#   'neural networks': {'neural networks', 'neural nets', 'NN'}
#   ...
# }
# edge_clusters={
#   'is type of': {'is type of', 'is a type of', 'is a kind of'},
#   'is related to': {'is related to', 'is connected to', 'is associated with'
#  ...}
# }

# EXAMPLE 3: Messages array
messages = [
  {"role": "user", "content": "What is the capital of France?"}, 
  {"role": "assistant", "content": "The capital of France is Paris."}
]
graph_3 = kg.generate(input_data=messages)
# Output: 
# entities={'Paris', 'France'} 
# edges={'has capital'} 
# relations={('France', 'has capital', 'Paris')}

# EXAMPLE 4: Combining multiple graphs
text1 = "Linda is Joe's mother. Ben is Joe's brother."

# Input text 2: also goes by Joe."
text2 = "Andrew is Joseph's father. Judy is Andrew's sister. Joseph also goes by Joe."

graph4_a = kg.generate(input_data=text1)
graph4_b = kg.generate(input_data=text2)

# Combine the graphs
combined_graph = kg.aggregate([graph4_a, graph4_b])

# Optionally cluster the combined graph
clustered_graph = kg.cluster(
  combined_graph,
  context="Family relationships"
)
# Output:
# entities={'Linda', 'Ben', 'Andrew', 'Joe', 'Joseph', 'Judy'} 
# edges={'is mother of', 'is father of', 'is brother of', 'is sister of'} 
# relations={('Linda', 'is mother of', 'Joe'),
#           ('Ben', 'is brother of', 'Joe'),
#           ('Andrew', 'is father of', 'Joe'),
#           ('Judy', 'is sister of', 'Andrew')}
# entity_clusters={
#   'Joe': {'Joe', 'Joseph'},
#   ...
# }
# edge_clusters={ ... }

Install from this repository:

Clone this repository and install dependencies using pip install -e '.[dev]'.

You may verify that it works by running python tests/test_basic.py from the root directory. This will also generate a nice visualization in tests/test_basic.html.

Features

Chunking Large Texts

For large texts, you can specify a chunk_size parameter to process the text in smaller chunks:

graph = kg.generate(
  input_data=large_text,
  chunk_size=5000  # Process in chunks of 5000 characters
)

Clustering Similar Entities and Relations

You can cluster similar entities and relations either during generation or afterwards:

# During generation
graph = kg.generate(
  input_data=text,
  cluster=True,
  context="Optional context to guide clustering"
)

# Or after generation
clustered_graph = kg.cluster(
  graph,
  context="Optional context to guide clustering"
)

Aggregating Multiple Graphs

You can combine multiple graphs using the aggregate method:

graph1 = kg.generate(input_data=text1)
graph2 = kg.generate(input_data=text2)
combined_graph = kg.aggregate([graph1, graph2])

Message Array Processing

When processing message arrays, kg-gen:

  1. Preserves the role information from each message
  2. Maintains message order and boundaries
  3. Can extract entities and relationships:
    • Between concepts mentioned in messages
    • Between speakers (roles) and concepts
    • Across multiple messages in a conversation

For example, given this conversation:

messages = [
  {"role": "user", "content": "What is the capital of France?"},
  {"role": "assistant", "content": "The capital of France is Paris."}
]

The generated graph might include entities like:

  • "user"
  • "assistant"
  • "France"
  • "Paris"

And relations like:

  • (user, "asks about", "France")
  • (assistant, "states", "Paris")
  • (Paris, "is capital of", "France")

API Reference

KGGen Class

Constructor Parameters

  • model: str = "openai/gpt-4o" - The model to use for generation
  • temperature: float = 0.0 - Temperature for model sampling
  • api_key: Optional[str] = None - API key for model access

generate() Method Parameters

  • input_data: Union[str, List[Dict]] - Text string or list of message dicts
  • model: Optional[str] - Override the default model
  • api_key: Optional[str] - Override the default API key
  • context: str = "" - Description of data context
  • chunk_size: Optional[int] - Size of text chunks to process
  • cluster: bool = False - Whether to cluster the graph after generation
  • temperature: Optional[float] - Override the default temperature
  • output_folder: Optional[str] - Path to save partial progress

cluster() Method Parameters

  • graph: Graph - The graph to cluster
  • context: str = "" - Description of data context
  • model: Optional[str] - Override the default model
  • temperature: Optional[float] - Override the default temperature
  • api_key: Optional[str] - Override the default API key

aggregate() Method Parameters

  • graphs: List[Graph] - List of graphs to combine

License

The MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kg_gen-0.2.3.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kg_gen-0.2.3-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file kg_gen-0.2.3.tar.gz.

File metadata

  • Download URL: kg_gen-0.2.3.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for kg_gen-0.2.3.tar.gz
Algorithm Hash digest
SHA256 6dace9549310b6f8e3d85cb9e15a6c3632e196847cd16b6e46a190a2c3d71622
MD5 6fb0c05f46105853ab122471d66bdddc
BLAKE2b-256 c3b0b73a3bd5f79595677da156e30fcbb7b67db3efd73ea2c64a5a5aeba307b2

See more details on using hashes here.

File details

Details for the file kg_gen-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: kg_gen-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for kg_gen-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 89d17ae28d7146d470e4d3e7ff07f60dc984ff0da18969235fa4f0d34f0f7405
MD5 a90d6b93de8afccc50d1029009b72d7e
BLAKE2b-256 35888a7c2b8fe789d18ee2f64a1d1863682e58b599737966e4c25a111821e2ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page