A simple tool to generate knowledge graphs from documents
Project description
Noosphere by Lepista Bioinformatics Lab
A powerful CLI tool for building knowledge graphs from unstructured text documents using Large Language Models (LLMs) and the LangChain LLM Graph Transformer. This project specializes in tool-based extraction for more accurate and consistent knowledge graph construction.
[!NOTE] Noosphere refers to the overall process of building a knowledge graph from unstructured text documents. This is described as the planetary "sphere of reason".
What is a Knowledge Graph?
A knowledge graph is a structured representation of information that connects entities (nodes) through relationships (edges). Unlike traditional databases that store data in tables, knowledge graphs represent information as a network of interconnected concepts, making it easier to understand complex relationships and answer multi-hop questions.
Why Knowledge Graphs Matter
Knowledge graphs are particularly valuable for:
- Retrieval-Augmented Generation (RAG): While text embeddings work well for simple queries, knowledge graphs excel at answering complex, multi-hop questions that require understanding connections across multiple entities
- Structured Operations: Enable filtering, sorting, and aggregation operations that are challenging with unstructured text
- Relationship Discovery: Reveal hidden connections and patterns in data that might not be apparent from raw text
- Semantic Search: Provide more accurate and contextually relevant search results
How It Works
The Noosphere Knowledge Graph Builder leverages LangChain's LLM Graph Transformer with tool-based extraction to automatically extract structured information from unstructured text documents. Here's how the process works:
1. Document Processing
The system processes various document formats using the docling library:
- PDF manuscripts (automatically converted to markdown)
- Plain text documents
- Any document format supported by docling
2. Tool-Based LLM Extraction
The system uses tool-based extraction exclusively, which provides:
- Structured Output: Uses LLMs with function calling capabilities (like GPT-4) for more accurate extraction
- Consistent Results: Leverages predefined schemas for entities and relationships
- Property Extraction: Supports detailed property extraction for both nodes and relationships
- Validation: Ensures extracted data conforms to defined rules and constraints
3. Configurable Graph Rules
The system uses YAML configuration files to define:
- Allowed Nodes: Specific entity types that can be extracted (e.g., Microorganism, Compound, Enzyme, Gene)
- Allowed Relationships: Valid connections between entities (e.g., "PRODUCES", "ACTS_IN", "AFFECTS")
- Node Properties: Attributes for entities (e.g., scientific_name, strain_name, description)
- Relationship Properties: Attributes for connections (e.g., production_date, reference_id)
- Text Processing: Chunk size, overlap, and other text splitting parameters
4. Graph Construction
The extracted information is structured into:
- Nodes: Entities defined in the configuration (e.g., microorganisms, compounds, enzymes)
- Relationships: Connections between entities with specific relationship types
- Properties: Additional attributes for nodes and relationships
5. Neo4j Database Storage
The constructed knowledge graph is stored in Neo4j, providing:
- Native graph operations and querying
- Built-in visualization capabilities
- Efficient traversal and relationship exploration
- Scalability for large knowledge graphs
Getting Started
Prerequisites
- Python 3.12+
- Neo4j database (local or cloud instance)
- OpenAI API key (for GPT-4 tool-based extraction)
Installation
-
Clone the repository:
git clone <repository-url> cd noosphere
-
Create a virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install project using poetry:
poetry install
Configuration
- Neo4j Setup: Configure your Neo4j database connection in
docker-compose.dev.ymlor set theNEO4J_URLenvironment variable - API Key: Set your OpenAI API key via the
LLM_API_KEYenvironment variable - Graph Rules: Create a YAML configuration file defining your knowledge graph schema
Basic Usage
The project provides a CLI tool called noosphere-cli with the following
commands:
Test Configuration
Test your graph rules configuration:
noosphere-cli test config --config-path path/to/config.yaml
Build Knowledge Graph
Build a knowledge graph from documents:
noosphere-cli kg build \
--documents-path /path/to/documents \
--config-path path/to/config.yaml \
--neo4j-url bolt://localhost:7687 \
--llm-api-key your-openai-api-key
Options:
--documents-path: Folder containing documents to process--pattern: Optional file pattern (e.g., "*.pdf") to filter documents--config-path: Path to YAML configuration file with graph rules--neo4j-url: Neo4j database connection URL--llm-api-key: OpenAI API key for LLM access
Configuration File Example
Create a YAML file (e.g., config.yaml) with your graph rules:
strict_mode: true
allowed_nodes:
- Microorganism
- Bacteria
- Fungi
- Plant
- Compound
- Enzyme
- Gene
allowed_relationships:
- [Microorganism, PRODUCES, Compound]
- [Microorganism, PRODUCES, Enzyme]
- [Plant, RECRUITS, Microorganism]
- [Microorganism, ACTS_IN, Plant]
node_properties:
- ScientificName
- CommonName
- Description
relationship_properties:
- ReferenceId
- ProductionDate
prompt: |
You are a helpful assistant that helps with building a graph of biological
relationships. Focus on microbial and host entities, environment and
conditions entities, microbial-to-microbial relationships, plant-to-microbial
relationships, etc.
chunk_size: 10000
chunk_overlap: 200
Key Features
Tool-Based Extraction
- Uses OpenAI's function calling for structured extraction
- Ensures consistent and validated output
- Supports complex entity and relationship schemas
Flexible Document Processing
- Automatic PDF to markdown conversion
- Configurable text chunking strategies
- Support for multiple document formats
Configurable Graph Schema
- YAML-based configuration for graph rules
- Validation of allowed nodes and relationships
- Customizable properties for entities and connections
Neo4j Integration
- Direct integration with Neo4j graph database
- Automatic graph document registration
- Support for source tracking and metadata
CLI Interface
- Simple command-line interface for all operations
- Environment variable support for configuration
- Comprehensive error handling and logging
Use Cases
- Scientific Literature: Extract research findings, chemical compounds, and biological relationships from research papers
- Biological Research: Map microorganism interactions, gene relationships, and metabolic pathways
- Agricultural Studies: Connect plant-microbe interactions and environmental conditions
- Medical Research: Identify drug-compound relationships and biological processes
- Academic Research: Build comprehensive knowledge bases from scientific documents
Architecture
Documents → Docling Converter → Text Chunks → LLM Graph Transformer → Structured Graph → Neo4j Database
↓ ↓ ↓ ↓ ↓ ↓
PDF/Text Markdown Output Configurable Tool-Based Nodes & Edges Graph Storage
with Metadata Chunking Extraction with Properties
Contributing
We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.
License
Acknowledgments
This project builds upon the excellent work of the LangChain community and the LLM Graph Transformer implementation. Special thanks to Tomaz Bratanic and the contributors who made this technology accessible to the broader community.
Resources
- LangChain Documentation
- Neo4j Documentation
- LLM Graph Transformer Guide
- Neo4j LLM Graph Builder (No-code alternative)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file noosphere_kg-0.2.0a0-py3-none-any.whl.
File metadata
- Download URL: noosphere_kg-0.2.0a0-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.3 Linux/6.14.0-29-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ac5e64d335337eee4ce1652e770bc61c1bb31fcdf075770b62a10842a244719
|
|
| MD5 |
349d6a000e7d5c4b1c68c78694ebd2e6
|
|
| BLAKE2b-256 |
0f8c3e712ea4e538f238287a9d714c8a36bd0e038b360b0633ab03d4de4f53c1
|