Skip to main content

A simple tool to generate knowledge graphs from documents

Project description

Noosphere by Lepista Bioinformatics Lab

A powerful CLI tool for building knowledge graphs from unstructured text documents using Large Language Models (LLMs) and the LangChain LLM Graph Transformer. This project specializes in tool-based extraction for more accurate and consistent knowledge graph construction.

[!NOTE] Noosphere refers to the overall process of building a knowledge graph from unstructured text documents. This is described as the planetary "sphere of reason".

What is a Knowledge Graph?

A knowledge graph is a structured representation of information that connects entities (nodes) through relationships (edges). Unlike traditional databases that store data in tables, knowledge graphs represent information as a network of interconnected concepts, making it easier to understand complex relationships and answer multi-hop questions.

Why Knowledge Graphs Matter

Knowledge graphs are particularly valuable for:

  • Retrieval-Augmented Generation (RAG): While text embeddings work well for simple queries, knowledge graphs excel at answering complex, multi-hop questions that require understanding connections across multiple entities
  • Structured Operations: Enable filtering, sorting, and aggregation operations that are challenging with unstructured text
  • Relationship Discovery: Reveal hidden connections and patterns in data that might not be apparent from raw text
  • Semantic Search: Provide more accurate and contextually relevant search results

How It Works

The Noosphere Knowledge Graph Builder leverages LangChain's LLM Graph Transformer with tool-based extraction to automatically extract structured information from unstructured text documents. Here's how the process works:

1. Document Processing

The system processes various document formats using the docling library:

  • PDF manuscripts (automatically converted to markdown)
  • Plain text documents
  • Any document format supported by docling

2. Tool-Based LLM Extraction

The system uses tool-based extraction exclusively, which provides:

  • Structured Output: Uses LLMs with function calling capabilities (like GPT-4) for more accurate extraction
  • Consistent Results: Leverages predefined schemas for entities and relationships
  • Property Extraction: Supports detailed property extraction for both nodes and relationships
  • Validation: Ensures extracted data conforms to defined rules and constraints

3. Configurable Graph Rules

The system uses YAML configuration files to define:

  • Allowed Nodes: Specific entity types that can be extracted (e.g., Microorganism, Compound, Enzyme, Gene)
  • Allowed Relationships: Valid connections between entities (e.g., "PRODUCES", "ACTS_IN", "AFFECTS")
  • Node Properties: Attributes for entities (e.g., scientific_name, strain_name, description)
  • Relationship Properties: Attributes for connections (e.g., production_date, reference_id)
  • Text Processing: Chunk size, overlap, and other text splitting parameters

4. Graph Construction

The extracted information is structured into:

  • Nodes: Entities defined in the configuration (e.g., microorganisms, compounds, enzymes)
  • Relationships: Connections between entities with specific relationship types
  • Properties: Additional attributes for nodes and relationships

5. Neo4j Database Storage

The constructed knowledge graph is stored in Neo4j, providing:

  • Native graph operations and querying
  • Built-in visualization capabilities
  • Efficient traversal and relationship exploration
  • Scalability for large knowledge graphs

Getting Started

Prerequisites

  • Python 3.12+
  • Neo4j database (local or cloud instance)
  • OpenAI API key (for GPT-4 tool-based extraction)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd noosphere
    
  2. Create a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install project using poetry:

    poetry install
    

Configuration

  1. Neo4j Setup: Configure your Neo4j database connection in docker-compose.dev.yml or set the NEO4J_URL environment variable
  2. API Key: Set your OpenAI API key via the LLM_API_KEY environment variable
  3. Graph Rules: Create a YAML configuration file defining your knowledge graph schema

Basic Usage

The project provides a CLI tool called noosphere-cli with the following commands:

Test Configuration

Test your graph rules configuration:

noosphere-cli test config --config-path path/to/config.yaml

Build Knowledge Graph

Build a knowledge graph from documents:

noosphere-cli kg build \
    --documents-path /path/to/documents \
    --config-path path/to/config.yaml \
    --neo4j-url bolt://localhost:7687 \
    --llm-api-key your-openai-api-key

Options:

  • --documents-path: Folder containing documents to process
  • --pattern: Optional file pattern (e.g., "*.pdf") to filter documents
  • --config-path: Path to YAML configuration file with graph rules
  • --neo4j-url: Neo4j database connection URL
  • --llm-api-key: OpenAI API key for LLM access

Configuration File Example

Create a YAML file (e.g., config.yaml) with your graph rules:

strict_mode: true

allowed_nodes:
  - Microorganism
  - Bacteria
  - Fungi
  - Plant
  - Compound
  - Enzyme
  - Gene

allowed_relationships:
  - [Microorganism, PRODUCES, Compound]
  - [Microorganism, PRODUCES, Enzyme]
  - [Plant, RECRUITS, Microorganism]
  - [Microorganism, ACTS_IN, Plant]

node_properties:
  - ScientificName
  - CommonName
  - Description

relationship_properties:
  - ReferenceId
  - ProductionDate

prompt: |
  You are a helpful assistant that helps with building a graph of biological 
  relationships. Focus on microbial and host entities, environment and 
  conditions entities, microbial-to-microbial relationships, plant-to-microbial 
  relationships, etc.

chunk_size: 10000
chunk_overlap: 200

Key Features

Tool-Based Extraction

  • Uses OpenAI's function calling for structured extraction
  • Ensures consistent and validated output
  • Supports complex entity and relationship schemas

Flexible Document Processing

  • Automatic PDF to markdown conversion
  • Configurable text chunking strategies
  • Support for multiple document formats

Configurable Graph Schema

  • YAML-based configuration for graph rules
  • Validation of allowed nodes and relationships
  • Customizable properties for entities and connections

Neo4j Integration

  • Direct integration with Neo4j graph database
  • Automatic graph document registration
  • Support for source tracking and metadata

CLI Interface

  • Simple command-line interface for all operations
  • Environment variable support for configuration
  • Comprehensive error handling and logging

Use Cases

  • Scientific Literature: Extract research findings, chemical compounds, and biological relationships from research papers
  • Biological Research: Map microorganism interactions, gene relationships, and metabolic pathways
  • Agricultural Studies: Connect plant-microbe interactions and environmental conditions
  • Medical Research: Identify drug-compound relationships and biological processes
  • Academic Research: Build comprehensive knowledge bases from scientific documents

Architecture

Documents → Docling Converter → Text Chunks → LLM Graph Transformer → Structured Graph → Neo4j Database
    ↓              ↓                    ↓              ↓                    ↓              ↓
PDF/Text    Markdown Output    Configurable    Tool-Based    Nodes & Edges    Graph Storage
            with Metadata      Chunking        Extraction    with Properties

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

License

Apache-2.0

Acknowledgments

This project builds upon the excellent work of the LangChain community and the LLM Graph Transformer implementation. Special thanks to Tomaz Bratanic and the contributors who made this technology accessible to the broader community.

Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

noosphere_kg-0.2.0a0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file noosphere_kg-0.2.0a0-py3-none-any.whl.

File metadata

  • Download URL: noosphere_kg-0.2.0a0-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.3 Linux/6.14.0-29-generic

File hashes

Hashes for noosphere_kg-0.2.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ac5e64d335337eee4ce1652e770bc61c1bb31fcdf075770b62a10842a244719
MD5 349d6a000e7d5c4b1c68c78694ebd2e6
BLAKE2b-256 0f8c3e712ea4e538f238287a9d714c8a36bd0e038b360b0633ab03d4de4f53c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page