A simple and efficient RAG (Retrieval-Augmented Generation) library with Knowledge Graph support.
Project description
Easy Knowledge Retriever - The easiest RAG lib ever
Easy Knowledge Retriever is a powerful and flexible library for building Retrieval-Augmented Generation (RAG) systems with integrated Knowledge Graph support. It allows you to easily ingest documents, build a structured knowledge base (combining vector embeddings and graph relations), and perform advanced queries using Large Language Models (LLMs).
Features
- Multimodal Ingestion: Parse & ingest PDF data containing images, tables, equations, ... Based on MinerU.
- Hybrid Retrieval: Combines vector similarity search with knowledge graph exploration for more context-aware answers.
- Smart Graph Re-ranking: Uses local centrality algorithms (PageRank) to filter and prioritize the most semantically relevant graph edges for the user query.
- Knowledge Graph Integration: Automatically extracts entities and relationships from your text documents.
- Modular Storage: Supports various backends for Key-Value pairs, Vector Stores, and Graph Storage (e.g., JSON, NanoVectorDB, NetworkX, Neo4j, Milvus).
- LLM Agnostic: Designed to work with OpenAI-compatible LLM APIs (OpenAI, Gemini via OpenAI adapter, etc.).
- Async Support: built with
asynciofor high-performance ingestion and retrieval.
Installation
You can install the library via pip:
pip install easy-knowledge-retriever
Quick Start
This guide will show you how to build a database from PDF documents and then query it.
1. Build the Database (Ingestion)
During this step, documents are processed, chunked, embedded, and entities/relations are extracted to build the Knowledge Graph options.
import asyncio
import os
from easy_knowledge_retriever import EasyKnowledgeRetriever
from easy_knowledge_retriever.llm.service import OpenAILLMService, OpenAIEmbeddingService
from easy_knowledge_retriever.kg.json_kv_impl import JsonKVStorage
from easy_knowledge_retriever.kg.nano_vector_db_impl import NanoVectorDBStorage
from easy_knowledge_retriever.kg.networkx_impl import NetworkXStorage
from easy_knowledge_retriever.kg.json_doc_status_impl import JsonDocStatusStorage
async def build_database():
# 1. Configure Services
# Replace with your actual API keys and endpoints
embedding_service = OpenAIEmbeddingService(
api_key="your-embedding-api-key",
base_url="https://api.openai.com/v1", # or compatible
model="text-embedding-3-small",
embedding_dim=1536
)
llm_service = OpenAILLMService(
model="gpt-4o",
api_key="your-llm-api-key",
base_url="https://api.openai.com/v1"
)
# 2. Initialize Retriever with specific storage backends
working_dir = "./rag_data"
rag = EasyKnowledgeRetriever(
working_dir=working_dir,
llm_service=llm_service,
embedding_service=embedding_service,
kv_storage=JsonKVStorage(),
vector_storage=NanoVectorDBStorage(cosine_better_than_threshold=0.2),
graph_storage=NetworkXStorage(),
doc_status_storage=JsonDocStatusStorage(),
)
await rag.initialize_storages()
try:
# 3. Ingest Documents
pdf_path = "./documents/example.pdf"
if os.path.exists(pdf_path):
print(f"Ingesting {pdf_path}...")
await rag.ingest(pdf_path)
print("Ingestion complete.")
else:
print("Please provide a valid PDF path.")
finally:
# Always finalize to save state
await rag.finalize_storages()
if __name__ == "__main__":
asyncio.run(build_database())
2. Retrieve Information (Querying)
Once the database is built, you can query it.
import asyncio
from easy_knowledge_retriever import EasyKnowledgeRetriever
from easy_knowledge_retriever.retrieval import MixRetrieval
from easy_knowledge_retriever.llm.service import OpenAILLMService, OpenAIEmbeddingService
from easy_knowledge_retriever.kg.json_kv_impl import JsonKVStorage
from easy_knowledge_retriever.kg.nano_vector_db_impl import NanoVectorDBStorage
from easy_knowledge_retriever.kg.networkx_impl import NetworkXStorage
from easy_knowledge_retriever.kg.json_doc_status_impl import JsonDocStatusStorage
async def query_knowledge_base():
# 1. Re-initialize Services (same config as build)
embedding_service = OpenAIEmbeddingService(
api_key="your-embedding-api-key",
base_url="https://api.openai.com/v1",
model="text-embedding-3-small",
embedding_dim=1536
)
llm_service = OpenAILLMService(
model="gpt-4o",
api_key="your-llm-api-key",
base_url="https://api.openai.com/v1"
)
# 2. Load the existing Retriever
working_dir = "./rag_data"
rag = EasyKnowledgeRetriever(
working_dir=working_dir,
llm_service=llm_service,
embedding_service=embedding_service,
kv_storage=JsonKVStorage(),
vector_storage=NanoVectorDBStorage(cosine_better_than_threshold=0.2),
graph_storage=NetworkXStorage(),
doc_status_storage=JsonDocStatusStorage(),
)
await rag.initialize_storages()
try:
# 3. Perform a Query
query_text = "What does the document say about forest fires?"
# 'mix' mode uses both vector search and knowledge graph
# Use MixRetrieval strategy
print(f"Querying: {query_text}")
result = await rag.aquery(query_text, retrieval=MixRetrieval())
print("\nAnswer:")
print(result)
finally:
await rag.finalize_storages()
if __name__ == "__main__":
asyncio.run(query_knowledge_base())
PDF Ingestion with Mineru (Images & Complex Layouts)
Easy Knowledge Retriever integrates Mineru (based on magic-pdf) to handle complex PDF documents, preserving layouts and extracting images.
Ingestion Pipeline Details
The ingestion process orchestrates several advanced steps to transform raw documents into a rich knowledge base:
- Parsing with Mineru: The system uses Mineru (based on Magic-PDF) to extract text, tables, and images with high structural fidelity.
- Multimodal Enrichment: Extracted images are processed by a Vision Language Model (VLM, e.g., GPT-4o). The VLM generates descriptive summaries which are injected directly into the text context, making visual data searchable.
- Page-Aware Chunking: The text is split into chunks using a sliding window approach that preserves the mapping to original page numbers for precise citations.
- Knowledge Graph Extraction: An LLM extracts entities and relationships from the chunks. It performs iterative gleaning to ensure no details are missed, building a structured graph of knowledge alongside vector embeddings.
Prerequisites
Ensure that the mineru dependencies are installed (included in requirements.txt) and that you are using a Vision-capable LLM model.
Example
The usage remains simple:
# ... Initialize RAG as shown in Quick Start ...
# Ingest a complex PDF with images
await rag.ingest("./documents/complex_report_with_charts.pdf")
# The system automatically handles image extraction and summarization.
Retrieval Workflow
The retrieval process employs a hybrid strategy that orchestrates parallel searches to capture both semantic meaning and explicit knowledge connections.
- Keyword Extraction: An LLM extracts both high-level concepts (for thematic search) and low-level entities (for specific details) from the user query.
- Parallel Search:
- Local Search: Navigates the Knowledge Graph using low-level entities to find direct neighbors and details.
- Global Search: Explores broader relationships in the Knowledge Graph using high-level concepts.
- Vector Search: Finds semantically similar text chunks from the vector database using the query embedding.
- Fusion & Context Building: Results from all sources are merged, deduplicated, and mapped back to their original source text chunks. This comprehensive context is then provided to the LLM to generate an accurate, grounded response.
Available Retrieval Strategies
Easy Knowledge Retriever offers flexible retrieval strategies to suit different use cases:
- Naive (
naive): Standard Vector Search on text chunks. Best for simple exact matches. - Local (
local): Entity-focused Graph Retrieval. Best for specific details about entities. - Global (
global): Relation-focused Graph Retrieval. Best for broad thematic questions. - Hybrid (
hybrid): Combines Local and Global Graph Retrieval. - Mix (
mix): Combines Graph (Hybrid) and Vector (Naive). Recommended default for best performance. - HybridMix (
hybrid_mix): Advanced Chunk Search combining Dense (Vector) and Sparse (BM25) search with Fusion.
For detailed workflows and comparisons, see Retrieval Strategies Documentation.
Advanced Configuration
Storage Options
You can swap out storage implementations by creating instances of different classes from easy_knowledge_retriever.kg.*:
- Vector Storage:
NanoVectorDBStorage(local, lightweight),MilvusStorage(scalable). - Graph Storage:
NetworkXStorage(in-memory/json, simple),Neo4jStorage(robust graph DB). - KV Storage:
JsonKVStorage,RedisKVStorage(if available), etc.
Example for Neo4j:
from easy_knowledge_retriever.kg.neo4j_impl import Neo4jStorage
graph_storage = Neo4jStorage(
uri="bolt://localhost:7687",
user="neo4j",
password="password"
)
Service & Configuration Catalog
For a complete, up-to-date list of all services (LLM, Vector/KV/Graph/Doc Status) and their configuration options, see:
- docs/ServiceCatalog.md
Development
To set up the project for development:
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt. - Install the package in editable mode:
pip install -e ..
Running Tests
(Instructions for running tests if applicable)
pytest
Evaluation
The system is evaluated using the RAGAS framework on two distinct datasets to assess performance across different domains and document types.
Dataset 1: Text Files (General Knowledge)
This dataset consists of plain text files covering general topics such as Forest Fires and Childbirth. It tests the system's ability to retrieve information from unstructured text.
Results (Dataset 1):
-
Faithfulness: 0.99 The Faithfulness metric measures how factually consistent a response is with the retrieved context. It ranges from 0 to 1, with higher scores indicating better consistency.
-
Context Recall: 1.0 Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important.
-
Answer Relevancy: 0.78 (Gemini 2.0 Flash Lite) / 0.81 (Gemini 2.5 Flash Lite) Answer Relevancy focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.
Dataset 2: Scientific PDFs (Technical Domain)
This dataset comprises scientific research papers in PDF format, specifically focusing on Deep Reinforcement Learning (DRL) for Autonomous Vehicle Intersection Management. It evaluates the system's capability to handle complex documents, scientific terminology, and multi-modal content.
Results (Dataset 2):
- Faithfulness: 1.0 (in both cases)
- Context Recall: 1.0 (in both cases)
- Answer Relevancy:
- 0.92 with Gemini 2.5 Flash Lite (Standard Retrieval without reranker).
- 0.96 with Gemini 2.5 Flash Lite using a Reranker and HybridMixRetrieval (combining Hybrid Vector Search + Knowledge Graph + Query Decomposition).
References
This project draws inspiration and references from the following projects:
License
This project is licensed under the Creative Commons Attribution–NonCommercial–ShareAlike 4.0 International (CC BY‑NC‑SA 4.0).
- You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- You may not use the material for commercial purposes.
- If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
Full legal text: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
Summary (EN): https://creativecommons.org/licenses/by-nc-sa/4.0/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easy_knowledge_retriever-1.2.4.tar.gz.
File metadata
- Download URL: easy_knowledge_retriever-1.2.4.tar.gz
- Upload date:
- Size: 253.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d31fcdc4a0b5572cf98887da55cea90b66ffc4bcde8d0374aa26a001094ed8f
|
|
| MD5 |
1effa55888306432162b13e3c7e3326f
|
|
| BLAKE2b-256 |
4f3e099a3fdaa75fdf659a824c0eb3115f1aab0b5fd162e1bf76905b22969ca5
|
File details
Details for the file easy_knowledge_retriever-1.2.4-py3-none-any.whl.
File metadata
- Download URL: easy_knowledge_retriever-1.2.4-py3-none-any.whl
- Upload date:
- Size: 288.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fbae5880a93897fffce90499938be7e2e5db4aa8d1b3312f3738b5ce95ca315
|
|
| MD5 |
513189f5faafc6397dab429a24ba2b3b
|
|
| BLAKE2b-256 |
91670c8b75abeef498e30d398957f1d0cb453cf59ab3c364e50ddfc230c5801e
|