Skip to main content

A simple and efficient RAG (Retrieval-Augmented Generation) library with Knowledge Graph support.

Project description

EKR Logo

Easy Knowledge Retriever - The easiest RAG lib ever

PyPI - Version Python Versions License: CC BY-NC-SA 4.0 Docs

Easy Knowledge Retriever is a powerful and flexible library for building Retrieval-Augmented Generation (RAG) systems with integrated Knowledge Graph support. It allows you to easily ingest documents, build a structured knowledge base (combining vector embeddings and graph relations), and perform advanced queries using Large Language Models (LLMs).

Global Flow

Features

  • Multimodal Ingestion: Parse & ingest PDF data containing images, tables, equations, ... Based on MinerU.
  • Hybrid Retrieval: Combines vector similarity search with knowledge graph exploration for more context-aware answers.
  • Smart Graph Re-ranking: Uses local centrality algorithms (PageRank) to filter and prioritize the most semantically relevant graph edges for the user query.
  • Knowledge Graph Integration: Automatically extracts entities and relationships from your text documents.
  • Modular Storage: Supports various backends for Key-Value pairs, Vector Stores, and Graph Storage (e.g., JSON, NanoVectorDB, NetworkX, Neo4j, Milvus).
  • LLM Agnostic: Designed to work with OpenAI-compatible LLM APIs (OpenAI, Gemini via OpenAI adapter, etc.).
  • Async Support: built with asyncio for high-performance ingestion and retrieval.

Installation

You can install the library via pip:

pip install easy-knowledge-retriever

Quick Start

This guide will show you how to build a database from PDF documents and then query it.

1. Build the Database (Ingestion)

During this step, documents are processed, chunked, embedded, and entities/relations are extracted to build the Knowledge Graph options.

import asyncio
import os
from easy_knowledge_retriever import EasyKnowledgeRetriever
from easy_knowledge_retriever.llm.service import OpenAILLMService, OpenAIEmbeddingService
from easy_knowledge_retriever.kg.json_kv_impl import JsonKVStorage
from easy_knowledge_retriever.kg.nano_vector_db_impl import NanoVectorDBStorage
from easy_knowledge_retriever.kg.networkx_impl import NetworkXStorage
from easy_knowledge_retriever.kg.json_doc_status_impl import JsonDocStatusStorage

async def build_database():
    # 1. Configure Services
    # Replace with your actual API keys and endpoints
    embedding_service = OpenAIEmbeddingService(
        api_key="your-embedding-api-key",
        base_url="https://api.openai.com/v1", # or compatible
        model="text-embedding-3-small",
        embedding_dim=1536
    )

    llm_service = OpenAILLMService(
        model="gpt-4o",
        api_key="your-llm-api-key",
        base_url="https://api.openai.com/v1"
    )

    # 2. Initialize Retriever with specific storage backends
    working_dir = "./rag_data"
    rag = EasyKnowledgeRetriever(
        working_dir=working_dir,
        llm_service=llm_service,
        embedding_service=embedding_service,
        kv_storage=JsonKVStorage(),
        vector_storage=NanoVectorDBStorage(cosine_better_than_threshold=0.2),
        graph_storage=NetworkXStorage(),
        doc_status_storage=JsonDocStatusStorage(),
    )

    await rag.initialize_storages()
    
    try:
        # 3. Ingest Documents
        pdf_path = "./documents/example.pdf"
        if os.path.exists(pdf_path):
            print(f"Ingesting {pdf_path}...")
            await rag.ingest(pdf_path)
            print("Ingestion complete.")
        else:
            print("Please provide a valid PDF path.")
            
    finally:
        # Always finalize to save state
        await rag.finalize_storages()

if __name__ == "__main__":
    asyncio.run(build_database())

2. Retrieve Information (Querying)

Once the database is built, you can query it.

import asyncio
from easy_knowledge_retriever import EasyKnowledgeRetriever
from easy_knowledge_retriever.retrieval import MixRetrieval
from easy_knowledge_retriever.llm.service import OpenAILLMService, OpenAIEmbeddingService
from easy_knowledge_retriever.kg.json_kv_impl import JsonKVStorage
from easy_knowledge_retriever.kg.nano_vector_db_impl import NanoVectorDBStorage
from easy_knowledge_retriever.kg.networkx_impl import NetworkXStorage
from easy_knowledge_retriever.kg.json_doc_status_impl import JsonDocStatusStorage

async def query_knowledge_base():
    # 1. Re-initialize Services (same config as build)
    embedding_service = OpenAIEmbeddingService(
        api_key="your-embedding-api-key",
        base_url="https://api.openai.com/v1",
        model="text-embedding-3-small",
        embedding_dim=1536
    )
    llm_service = OpenAILLMService(
        model="gpt-4o",
        api_key="your-llm-api-key",
        base_url="https://api.openai.com/v1"
    )

    # 2. Load the existing Retriever
    working_dir = "./rag_data"
    rag = EasyKnowledgeRetriever(
        working_dir=working_dir,
        llm_service=llm_service,
        embedding_service=embedding_service,
        kv_storage=JsonKVStorage(),
        vector_storage=NanoVectorDBStorage(cosine_better_than_threshold=0.2),
        graph_storage=NetworkXStorage(),
        doc_status_storage=JsonDocStatusStorage(),
    )

    await rag.initialize_storages()

    try:
        # 3. Perform a Query
        query_text = "What does the document say about forest fires?"
        
        # 'mix' mode uses both vector search and knowledge graph
        # Use MixRetrieval strategy
        
        print(f"Querying: {query_text}")
        result = await rag.aquery(query_text, retrieval=MixRetrieval())
        
        print("\nAnswer:")
        print(result)
        
    finally:
        await rag.finalize_storages()

if __name__ == "__main__":
    asyncio.run(query_knowledge_base())

PDF Ingestion with Mineru (Images & Complex Layouts)

Easy Knowledge Retriever integrates Mineru (based on magic-pdf) to handle complex PDF documents, preserving layouts and extracting images.

Ingestion Pipeline Details

The ingestion process orchestrates several advanced steps to transform raw documents into a rich knowledge base:

Ingestion Flow

  1. Parsing with Mineru: The system uses Mineru (based on Magic-PDF) to extract text, tables, and images with high structural fidelity.
  2. Multimodal Enrichment: Extracted images are processed by a Vision Language Model (VLM, e.g., GPT-4o). The VLM generates descriptive summaries which are injected directly into the text context, making visual data searchable.
  3. Page-Aware Chunking: The text is split into chunks using a sliding window approach that preserves the mapping to original page numbers for precise citations.
  4. Knowledge Graph Extraction: An LLM extracts entities and relationships from the chunks. It performs iterative gleaning to ensure no details are missed, building a structured graph of knowledge alongside vector embeddings.

Prerequisites

Ensure that the mineru dependencies are installed (included in requirements.txt) and that you are using a Vision-capable LLM model.

Example

The usage remains simple:

# ... Initialize RAG as shown in Quick Start ...

# Ingest a complex PDF with images
await rag.ingest("./documents/complex_report_with_charts.pdf")

# The system automatically handles image extraction and summarization.

Retrieval Workflow

The retrieval process employs a hybrid strategy that orchestrates parallel searches to capture both semantic meaning and explicit knowledge connections.

Retrieval Workflow

  1. Keyword Extraction: An LLM extracts both high-level concepts (for thematic search) and low-level entities (for specific details) from the user query.
  2. Parallel Search:
    • Local Search: Navigates the Knowledge Graph using low-level entities to find direct neighbors and details.
    • Global Search: Explores broader relationships in the Knowledge Graph using high-level concepts.
    • Vector Search: Finds semantically similar text chunks from the vector database using the query embedding.
  3. Fusion & Context Building: Results from all sources are merged, deduplicated, and mapped back to their original source text chunks. This comprehensive context is then provided to the LLM to generate an accurate, grounded response.

Available Retrieval Strategies

Easy Knowledge Retriever offers flexible retrieval strategies to suit different use cases:

  • Naive (naive): Standard Vector Search on text chunks. Best for simple exact matches.
  • Local (local): Entity-focused Graph Retrieval. Best for specific details about entities.
  • Global (global): Relation-focused Graph Retrieval. Best for broad thematic questions.
  • Hybrid (hybrid): Combines Local and Global Graph Retrieval.
  • Mix (mix): Combines Graph (Hybrid) and Vector (Naive). Recommended default for best performance.
  • HybridMix (hybrid_mix): Advanced Chunk Search combining Dense (Vector) and Sparse (BM25) search with Fusion.

For detailed workflows and comparisons, see Retrieval Strategies Documentation.

Advanced Configuration

Storage Options

You can swap out storage implementations by creating instances of different classes from easy_knowledge_retriever.kg.*:

  • Vector Storage: NanoVectorDBStorage (local, lightweight), MilvusStorage (scalable).
  • Graph Storage: NetworkXStorage (in-memory/json, simple), Neo4jStorage (robust graph DB).
  • KV Storage: JsonKVStorage, RedisKVStorage (if available), etc.

Example for Neo4j:

from easy_knowledge_retriever.kg.neo4j_impl import Neo4jStorage

graph_storage = Neo4jStorage(
    uri="bolt://localhost:7687",
    user="neo4j",
    password="password"
)

Service & Configuration Catalog

For a complete, up-to-date list of all services (LLM, Vector/KV/Graph/Doc Status) and their configuration options, see:

  • docs/ServiceCatalog.md

Development

To set up the project for development:

  1. Clone the repository.
  2. Install dependencies: pip install -r requirements.txt.
  3. Install the package in editable mode: pip install -e ..

Running Tests

(Instructions for running tests if applicable)

pytest

Evaluation

The system is evaluated using the RAGAS framework on two distinct datasets to assess performance across different domains and document types.

Dataset 1: Text Files (General Knowledge)

This dataset consists of plain text files covering general topics such as Forest Fires and Childbirth. It tests the system's ability to retrieve information from unstructured text.

Results (Dataset 1):

  • Faithfulness: 0.99 The Faithfulness metric measures how factually consistent a response is with the retrieved context. It ranges from 0 to 1, with higher scores indicating better consistency.

  • Context Recall: 1.0 Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important.

  • Answer Relevancy: 0.78 (Gemini 2.0 Flash Lite) / 0.81 (Gemini 2.5 Flash Lite) Answer Relevancy focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.

Dataset 2: Scientific PDFs (Technical Domain)

This dataset comprises scientific research papers in PDF format, specifically focusing on Deep Reinforcement Learning (DRL) for Autonomous Vehicle Intersection Management. It evaluates the system's capability to handle complex documents, scientific terminology, and multi-modal content.

Results (Dataset 2):

  • Faithfulness: 1.0 (in both cases)
  • Context Recall: 1.0 (in both cases)
  • Answer Relevancy:
    • 0.92 with Gemini 2.5 Flash Lite (Standard Retrieval without reranker).
    • 0.96 with Gemini 2.5 Flash Lite using a Reranker and HybridMixRetrieval (combining Hybrid Vector Search + Knowledge Graph + Query Decomposition).

References

This project draws inspiration and references from the following projects:

License

This project is licensed under the Creative Commons Attribution–NonCommercial–ShareAlike 4.0 International (CC BY‑NC‑SA 4.0).

  • You must give appropriate credit, provide a link to the license, and indicate if changes were made.
  • You may not use the material for commercial purposes.
  • If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Full legal text: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
Summary (EN): https://creativecommons.org/licenses/by-nc-sa/4.0/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_knowledge_retriever-1.2.4.tar.gz (253.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easy_knowledge_retriever-1.2.4-py3-none-any.whl (288.0 kB view details)

Uploaded Python 3

File details

Details for the file easy_knowledge_retriever-1.2.4.tar.gz.

File metadata

  • Download URL: easy_knowledge_retriever-1.2.4.tar.gz
  • Upload date:
  • Size: 253.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for easy_knowledge_retriever-1.2.4.tar.gz
Algorithm Hash digest
SHA256 7d31fcdc4a0b5572cf98887da55cea90b66ffc4bcde8d0374aa26a001094ed8f
MD5 1effa55888306432162b13e3c7e3326f
BLAKE2b-256 4f3e099a3fdaa75fdf659a824c0eb3115f1aab0b5fd162e1bf76905b22969ca5

See more details on using hashes here.

File details

Details for the file easy_knowledge_retriever-1.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for easy_knowledge_retriever-1.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9fbae5880a93897fffce90499938be7e2e5db4aa8d1b3312f3738b5ce95ca315
MD5 513189f5faafc6397dab429a24ba2b3b
BLAKE2b-256 91670c8b75abeef498e30d398957f1d0cb453cf59ab3c364e50ddfc230c5801e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page