A simple tool to generate knowledge graphs from documents

These details have not been verified by PyPI

Project links

Homepage

Project description

Noosphere by Lepista Bioinformatics Lab

A powerful CLI tool for building knowledge graphs from unstructured text documents using Large Language Models (LLMs) and the LangChain LLM Graph Transformer. This project specializes in tool-based extraction for more accurate and consistent knowledge graph construction.

[!NOTE] Noosphere refers to the overall process of building a knowledge graph from unstructured text documents. This is described as the planetary "sphere of reason".

What is a Knowledge Graph?

A knowledge graph is a structured representation of information that connects entities (nodes) through relationships (edges). Unlike traditional databases that store data in tables, knowledge graphs represent information as a network of interconnected concepts, making it easier to understand complex relationships and answer multi-hop questions.

Why Knowledge Graphs Matter

Knowledge graphs are particularly valuable for:

Retrieval-Augmented Generation (RAG): While text embeddings work well for simple queries, knowledge graphs excel at answering complex, multi-hop questions that require understanding connections across multiple entities
Structured Operations: Enable filtering, sorting, and aggregation operations that are challenging with unstructured text
Relationship Discovery: Reveal hidden connections and patterns in data that might not be apparent from raw text
Semantic Search: Provide more accurate and contextually relevant search results

How It Works

The Noosphere Knowledge Graph Builder leverages LangChain's LLM Graph Transformer with tool-based extraction to automatically extract structured information from unstructured text documents. Here's how the process works:

1. Document Processing

The system processes various document formats using the docling library:

PDF manuscripts (automatically converted to markdown)
Plain text documents
Any document format supported by docling

2. Tool-Based LLM Extraction

The system uses tool-based extraction exclusively, which provides:

Structured Output: Uses LLMs with function calling capabilities (like GPT-4) for more accurate extraction
Consistent Results: Leverages predefined schemas for entities and relationships
Property Extraction: Supports detailed property extraction for both nodes and relationships
Validation: Ensures extracted data conforms to defined rules and constraints

3. Configurable Graph Rules

The system uses YAML configuration files to define:

Allowed Nodes: Specific entity types that can be extracted (e.g., Microorganism, Compound, Enzyme, Gene)
Allowed Relationships: Valid connections between entities (e.g., "PRODUCES", "ACTS_IN", "AFFECTS")
Node Properties: Attributes for entities (e.g., scientific_name, strain_name, description)
Relationship Properties: Attributes for connections (e.g., production_date, reference_id)
Text Processing: Chunk size, overlap, and other text splitting parameters

4. Graph Construction

The extracted information is structured into:

Nodes: Entities defined in the configuration (e.g., microorganisms, compounds, enzymes)
Relationships: Connections between entities with specific relationship types
Properties: Additional attributes for nodes and relationships

5. Neo4j Database Storage

The constructed knowledge graph is stored in Neo4j, providing:

Native graph operations and querying
Built-in visualization capabilities
Efficient traversal and relationship exploration
Scalability for large knowledge graphs

Getting Started

Prerequisites

Python 3.12+
Neo4j database (local or cloud instance)
OpenAI API key (for GPT-4 tool-based extraction)

Installation

Clone the repository:

git clone <repository-url>
cd noosphere

Create a virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install project using poetry:
```
poetry install
```

Configuration

Neo4j Setup: Configure your Neo4j database connection in docker-compose.dev.yml or set the NEO4J_URL environment variable
API Key: Set your OpenAI API key via the LLM_API_KEY environment variable
Graph Rules: Create a YAML configuration file defining your knowledge graph schema

Basic Usage

The project provides a CLI tool called noosphere-cli with the following commands:

Test Configuration

Test your graph rules configuration:

noosphere-cli test config --config-path path/to/config.yaml

Build Knowledge Graph

Build a knowledge graph from documents:

noosphere-cli kg build \
    --documents-path /path/to/documents \
    --config-path path/to/config.yaml \
    --neo4j-url bolt://localhost:7687 \
    --llm-api-key your-openai-api-key

Options:

--documents-path: Folder containing documents to process
--pattern: Optional file pattern (e.g., "*.pdf") to filter documents
--config-path: Path to YAML configuration file with graph rules
--neo4j-url: Neo4j database connection URL
--llm-api-key: OpenAI API key for LLM access

Configuration File Example

Create a YAML file (e.g., config.yaml) with your graph rules:

strict_mode: true

allowed_nodes:
  - Microorganism
  - Bacteria
  - Fungi
  - Plant
  - Compound
  - Enzyme
  - Gene

allowed_relationships:
  - [Microorganism, PRODUCES, Compound]
  - [Microorganism, PRODUCES, Enzyme]
  - [Plant, RECRUITS, Microorganism]
  - [Microorganism, ACTS_IN, Plant]

node_properties:
  - ScientificName
  - CommonName
  - Description

relationship_properties:
  - ReferenceId
  - ProductionDate

prompt: |
  You are a helpful assistant that helps with building a graph of biological 
  relationships. Focus on microbial and host entities, environment and 
  conditions entities, microbial-to-microbial relationships, plant-to-microbial 
  relationships, etc.

chunk_size: 10000
chunk_overlap: 200

Key Features

Tool-Based Extraction

Uses OpenAI's function calling for structured extraction
Ensures consistent and validated output
Supports complex entity and relationship schemas

Flexible Document Processing

Automatic PDF to markdown conversion
Configurable text chunking strategies
Support for multiple document formats

Configurable Graph Schema

YAML-based configuration for graph rules
Validation of allowed nodes and relationships
Customizable properties for entities and connections

Neo4j Integration

Direct integration with Neo4j graph database
Automatic graph document registration
Support for source tracking and metadata

CLI Interface

Simple command-line interface for all operations
Environment variable support for configuration
Comprehensive error handling and logging

Use Cases

Scientific Literature: Extract research findings, chemical compounds, and biological relationships from research papers
Biological Research: Map microorganism interactions, gene relationships, and metabolic pathways
Agricultural Studies: Connect plant-microbe interactions and environmental conditions
Medical Research: Identify drug-compound relationships and biological processes
Academic Research: Build comprehensive knowledge bases from scientific documents

Architecture

Documents → Docling Converter → Text Chunks → LLM Graph Transformer → Structured Graph → Neo4j Database
    ↓              ↓                    ↓              ↓                    ↓              ↓
PDF/Text    Markdown Output    Configurable    Tool-Based    Nodes & Edges    Graph Storage
            with Metadata      Chunking        Extraction    with Properties

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

License

Apache-2.0

Acknowledgments

This project builds upon the excellent work of the LangChain community and the LLM Graph Transformer implementation. Special thanks to Tomaz Bratanic and the contributors who made this technology accessible to the broader community.

Resources

LangChain Documentation
Neo4j Documentation
LLM Graph Transformer Guide
Neo4j LLM Graph Builder (No-code alternative)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.0a0 pre-release

Sep 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

noosphere_kg-0.2.0a0-py3-none-any.whl (13.9 kB view details)

Uploaded Sep 2, 2025 Python 3

File details

Details for the file noosphere_kg-0.2.0a0-py3-none-any.whl.

File metadata

Download URL: noosphere_kg-0.2.0a0-py3-none-any.whl
Upload date: Sep 2, 2025
Size: 13.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Linux/6.14.0-29-generic

File hashes

Hashes for noosphere_kg-0.2.0a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ac5e64d335337eee4ce1652e770bc61c1bb31fcdf075770b62a10842a244719`
MD5	`349d6a000e7d5c4b1c68c78694ebd2e6`
BLAKE2b-256	`0f8c3e712ea4e538f238287a9d714c8a36bd0e038b360b0633ab03d4de4f53c1`

See more details on using hashes here.

noosphere-kg 0.2.0a0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Noosphere by Lepista Bioinformatics Lab

What is a Knowledge Graph?

Why Knowledge Graphs Matter

How It Works

1. Document Processing

2. Tool-Based LLM Extraction

3. Configurable Graph Rules

4. Graph Construction

5. Neo4j Database Storage

Getting Started

Prerequisites

Installation

Configuration

Basic Usage

Test Configuration

Build Knowledge Graph

Configuration File Example

Key Features

Tool-Based Extraction

Flexible Document Processing

Configurable Graph Schema

Neo4j Integration

CLI Interface

Use Cases

Architecture

Contributing

License

Acknowledgments

Resources

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes