A set of helper classes that abstract some of the more common tasks of a typical RAG process including document loading/web scraping.

These details have not been verified by PyPI

Project description

Ragdoll

RAGdoll: A Flexible and Extensible RAG Framework

Welcome to RAGdoll 2.2! This release continues the evolution of the RAGdoll project with enhanced flexibility, extensibility, and maintainability. We've refactored the core architecture to make it easier than ever to adapt RAGdoll to your specific needs and integrate it with the broader LangChain ecosystem. Version 2.2 introduces advanced entity extraction controls, improved graph retrieval with embedding-based seed search, and comprehensive configuration options for fine-tuning your RAG pipeline.

🧭 Project Overview

RAGdoll 2 is an extensible framework for building Retrieval-Augmented Generation (RAG) applications. It provides a modular architecture that allows you to easily integrate various data sources, chunking strategies, embedding models, vector stores, large language models (LLMs), and graph stores. RAGdoll is designed to be flexible and fast while relying solely on open-source third-party libraries (LangChain, Chroma, spaCy, etc.). It's also designed to accomodate a broad array of file types without any initial dependency on third party hosted services using langchain-markitdown. The loaders can easily be swapped out with any compatible lanchain loader when ready for production.

Note that RAGdoll 2 is a complete overhaul of the initial RAGdoll project and is not backwards compatible in any respect.

How RAGdoll compares to GraphRAG-style tools

RAGdoll started as a learning project and has grown into a modular orchestrator. I've tried to focus the detail below on what is actually shipped in this repo (no performance claims or benchmarks are published yet) to help position RAGdoll in the broader landscape.

Scope: RAGdoll orchestrates loaders, chunkers, embeddings, vector stores, LLMs, and an optional graph layer. The GraphRAG family (GraphRAG, NanoGraphRAG, Fast GraphRAG) is primarily graph-forward; RAGdoll adds the rest of the RAG plumbing and a demo UI.
Graph building: Entities come from spaCy NER; relations come from prompt-based LLM extraction per chunk (configurable via YAML prompts/parsers). It stores a flat graph (JSON/NetworkX/Neo4j) and exposes a retriever—there is no community detection or hierarchical summarization step.
Retrieval: The default query path is vector-only. When you call ingest_with_graph, you can also obtain a graph retriever (simple in-memory or Neo4j) and combine it with vector search yourself for hybrid flows; no automatic community summaries are generated. A lightweight hybrid retriever (query(..., use_hybrid=True) or query_hybrid) now merges vector hits with graph nodes for a fast graph-aware context path.
Config and runtime: Everything is wired through YAML and LangChain abstractions. Defaults point at OpenAI models, but you can swap in local embeddings/LLMs and different stores (Chroma/FAISS/Neo4j) without code changes. Caching and monitoring are built in but optional.
Benchmarks: Cost/speed/quality numbers depend entirely on the models and stores you pick; we have not published comparisons against GraphRAG variants, so avoid quoting figures until you run your own measurements.

What's New

Enhanced Features in RAGdoll 2.1

This version of RAGdoll introduces significant performance and architectural improvements:

Parallel Execution (NEW in 2.1): Concurrent processing for embeddings and entity extraction with configurable rate limiting. Achieves 5-8x faster pipeline execution for typical workloads.
Embedding-based Graph Retrieval (NEW in 2.1): GraphRetriever now supports embedding-based seed node selection using vector store integration, dramatically improving retrieval accuracy over fuzzy text matching.
Vector ID Linkage (NEW in 2.1): Proper linking between graph nodes and vector embeddings ensures seamless hybrid retrieval without orphaned nodes.
Modular Retrieval Architecture (NEW in 2.1): Clean separation between VectorRetriever, GraphRetriever, and HybridRetriever with multiple combination strategies.
Caching: Store and reuse results from previous operations to avoid redundant computations.
Auto Loader Selection: Includes loaders for multiple file types with Langchain-Markitdown as default, configurable to any LangChain-compatible loader.
Monitoring: Track and understand the performance and behavior of your RAG applications over time.

# Enable monitoring in config
monitor:
  enabled: true

Quick Start Guide

Here's a quick example of how to get started with RAGdoll using the new LLM caller abstraction:

from ragdoll.ragdoll import Ragdoll
from ragdoll.llms import get_llm_caller

# Resolve whichever model is marked as default in config (or pass a model name).
llm_caller = get_llm_caller()

# Spin up the orchestrator with sensible defaults.
ragdoll = Ragdoll(llm_caller=llm_caller)

# Ingest a few local files (vector store + caches handled automatically).
ragdoll.ingest_data(["path/to/document.md", "path/to/notes.pdf"])

# Run a retrieval + answer round trip.
result = ragdoll.query("What is the capital of France?")
print(result["answer"])

Need finer control over loaders or paths? Use settings.get_app() (or bootstrap_app with overrides) to obtain the shared AppConfig, tweak its config, and pass component overrides into Ragdoll.

Demo Application

RAGdoll includes an interactive web-based demo application that showcases its capabilities. The demo provides a user-friendly interface to:

Configure RAGdoll settings using a YAML editor
Ingest documents from various sources (files, URLs, text)
Explore the ingestion pipeline and data transformations
Query the RAG system and view retrieval traces
Monitor performance metrics and caching

Running the Demo

To run the demo application:

# Ensure dependencies are installed
pip install -e .[all]

# Start the demo server
uvicorn demo_app.main:app --reload

Then open your browser to http://localhost:8000 to access the demo interface.

The demo uses FastAPI for the backend, HTMX and Alpine.js for dynamic interactions, and Tailwind CSS for styling, providing a modern, responsive experience.

Performance & Parallel Execution

RAGdoll 2.1+ includes comprehensive parallel execution optimizations for significantly faster ingestion and entity extraction:

Parallel Embeddings (Vector Store Layer)

BREAKING CHANGE in 2.1: Parallel embedding logic moved from IngestionPipeline to BaseVectorStore for better separation of concerns and reusability.

Concurrent batch processing: Processes multiple embedding batches simultaneously via add_documents_parallel()
Configuration: Set max_concurrent_embeddings in EmbeddingsConfig (YAML: embeddings.max_concurrent_embeddings, default: 3)
Automatic batching: Intelligently splits documents into batches for optimal throughput
Performance gain: 3-5x faster embedding creation compared to sequential processing
Retry logic: Automatically retries failed batches sequentially for robustness

New Async API:

from ragdoll.config import Config
from ragdoll.vector_stores import create_vector_store
from ragdoll.embeddings import get_embedding_model

config = Config()
embeddings = get_embedding_model(config_manager=config)
vector_store = create_vector_store("faiss", embedding=embeddings)

# Parallel processing with configured concurrency
max_concurrent = config.embeddings_config.max_concurrent_embeddings
ids = await vector_store.add_documents_parallel(
    documents=chunks,
    batch_size=10,
    max_concurrent=max_concurrent
)

Parallel Entity Extraction

Concurrent LLM calls: Processes multiple documents simultaneously with rate limiting (configurable via max_concurrent_llm_calls, default: 8)
Automatic parallelization: Enabled by default for document sets with 4+ documents
Smart fallback: Uses sequential processing for small batches to avoid overhead
Performance gain: 5-10x faster entity extraction (limited by API rate limits)

Configuration Example

# In default_config.yaml or app config
embeddings:
  default_model: openai
  max_concurrent_embeddings: 5 # NEW: Controls parallel embedding batches
  models:
    openai:
      provider: openai
      model: text-embedding-3-large

from ragdoll.pipeline import IngestionOptions

# High-speed processing (good API limits)
options = IngestionOptions(
    batch_size=20,
    max_concurrent_llm_calls=15,  # Note: max_concurrent_embeddings removed from here
    parallel_extraction=True  # Enabled by default
)

# Conservative (rate limit sensitive)
options = IngestionOptions(
    batch_size=10,
    max_concurrent_llm_calls=4
)

# Use with Ragdoll - embeddings concurrency comes from config
result = await ragdoll.ingest_with_graph(sources, options=options)

Expected improvements: 5-8x faster full pipeline execution for typical workloads. Actual speedups depend on your hardware, API rate limits, and document characteristics.

Performance Testing: Run pytest tests/test_parallel_performance.py -v -s to see detailed performance comparisons with metrics.

Modular Retrieval Architecture

RAGdoll 2.1+ features a completely refactored retrieval system with clean separation between graph building and graph querying. The new architecture provides three composable retrievers:

VectorRetriever

Semantic similarity search using vector embeddings:

from ragdoll import VectorRetriever

vector_retriever = VectorRetriever(
    vector_store=vector_store,
    top_k=5,
    search_type="mmr"  # or "similarity", "similarity_score_threshold"
)
docs = vector_retriever.get_relevant_documents("query")

GraphRetriever

Multi-hop graph traversal with BFS/DFS strategies and embedding-based seed search:

from ragdoll import GraphRetriever

graph_retriever = GraphRetriever(
    graph_store=graph_store,
    vector_store=vector_store,  # Optional: enables embedding-based seed search
    embedding_model=embedding_model,  # Optional: required if vector_store provided
    top_k=5,
    max_hops=2,
    traversal_strategy="bfs",  # or "dfs"
    include_edges=True,
    prebuild_index=False,  # Build embedding index during initialization
    hybrid_alpha=1.0,  # Weight for embedding similarity (1.0 = embedding only)
    enable_fallback=True,  # Fall back to fuzzy matching if embeddings unavailable
    log_fallback_warnings=True  # Log warnings when using fallback
)
docs = graph_retriever.get_relevant_documents("query")

Key Features:

Embedding-based seed search: When configured with vector_store and embedding_model, GraphRetriever can find seed nodes by embedding similarity rather than fuzzy text matching, significantly improving retrieval accuracy
Automatic deduplication: Handles multiple entities from the same document chunk that share vector IDs, ensuring efficient queries without duplicates
Intelligent fallback: Automatically falls back to fuzzy text matching when embeddings are unavailable, with configurable warnings
Vector store integration: Seamlessly integrates with any LangChain vector store (Chroma, FAISS) to leverage existing embeddings
Configurable index prebuilding: Option to build FAISS index during initialization for faster first queries

HybridRetriever

Combines vector and graph retrieval with multiple strategies:

from ragdoll import HybridRetriever

hybrid_retriever = HybridRetriever(
    vector_retriever=vector_retriever,
    graph_retriever=graph_retriever,
    mode="rerank",  # or "concat", "weighted", "expand"
    vector_weight=0.6,
    graph_weight=0.4
)
docs = hybrid_retriever.get_relevant_documents("query")

Complete Example with Graph Pipeline

import asyncio
from ragdoll import Ragdoll
from ragdoll.pipeline import ingest_from_vector_store, IngestionOptions

async def main():
    ragdoll = Ragdoll()

    # Configure parallel execution for best performance
    options = IngestionOptions(
        parallel_extraction=True,          # Enabled by default
        max_concurrent_embeddings=3,       # Process 3 embedding batches concurrently
        max_concurrent_llm_calls=8,        # Limit concurrent LLM calls
        batch_size=10                      # Documents per batch
    )

    # Method 1: Ingest documents and build knowledge graph in one step
    result = await ragdoll.ingest_with_graph(
        ["path/to/docs/manual.pdf"],
        options=options
    )

    print(result["stats"])        # Ingestion metrics
    print(result["graph"])        # Pydantic Graph object
    print(result["graph_store"])  # NetworkX/Neo4j/JSON graph store
    print(result["vector_store"]) # Vector store with document embeddings

    # Method 2: Build graph from existing vector store (preserves vector_ids)
    # This ensures graph nodes reference the same embeddings as the vector store
    result = await ingest_from_vector_store(
        vector_store=existing_vector_store,
        graph_store=graph_store,
        entity_service=entity_service
    )
    graph_retriever = result["graph_retriever"]  # Pre-configured with vector store

    # Use the automatically configured hybrid retriever
    answer = ragdoll.query_hybrid("How does the widget fail-safe work?")
    print(answer["answer"])
    print(f"Retrieved {len(answer['documents'])} documents")

asyncio.run(main())

The helper ingest_with_graph_sync() wraps asyncio.run() for scripts that are not already running an event loop.

Configuration: All retrieval and performance settings are now consolidated in your config:

# Retrieval configuration
retriever:
  vector:
    enabled: true
    top_k: 3
    search_type: "similarity"
  graph:
    enabled: true
    backend: "networkx" # Options: "networkx", "neo4j", "simple"
    top_k: 5
    max_hops: 2
    traversal_strategy: "bfs" # Options: "bfs", "dfs"
    include_edges: true
    min_score: 0.0
    prebuild_index: false # Build FAISS index during initialization
    hybrid_alpha: 1.0 # Weight for embedding similarity (1.0 = embedding only)
    enable_fallback: true # Fall back to fuzzy matching if embeddings unavailable
    log_fallback_warnings: true # Log warnings when using fallback
  hybrid:
    mode: "concat" # Options: "concat", "rerank", "weighted", "expand"
    vector_weight: 0.6
    graph_weight: 0.4
    deduplicate: true

# Performance settings (applied via IngestionOptions)
# These can also be configured programmatically
pipeline:
  batch_size: 10
  parallel_extraction: true
  max_concurrent_embeddings: 3
  max_concurrent_llm_calls: 8

See docs/retrieval.md for comprehensive documentation and examples/retrieval_examples.py for complete examples.

How Vector and Graph Stores Work Together

Ragdoll keeps both storage backends under the same orchestration surface:

Ragdoll.ingest_data(...) (or the lower-level IngestionPipeline) always loads documents, chunks them, embeds each chunk, and writes those embeddings into the configured vector store.
When entity_extraction.extract_entities (or entity_extraction.graph_retriever.enabled) is true, the same pipeline also fans out chunks to the entity extraction service, which generates a graph, persists it through the configured graph store, and can return a graph-aware retriever.
Both flows are coordinated inside IngestionPipeline: it receives the shared AppConfig, builds the ingestion service, embedding model, vector store, and optionally graph store, and emits stats/retrievers back through Ragdoll.

Building Graphs from Existing Vector Stores:

RAGdoll 2.1+ introduces EntityExtractionService.extract_from_vector_store() and the corresponding ingest_from_vector_store() pipeline function. This allows you to:

Extract documents directly from an existing vector store (Chroma, FAISS, or any LangChain vector store)
Build a knowledge graph that references the same vector IDs as the vector store
Create a GraphRetriever pre-configured with both the graph store and vector store for embedding-based seed search
Avoid vector ID mismatches between graph nodes and vector store documents

This is particularly useful when you want to add graph capabilities to an existing vector store without re-ingesting all documents. The vector_id in each graph node's metadata matches the document ID in the vector store, enabling seamless integration between vector and graph retrieval.

Deduplication Handling:

Multiple entities extracted from the same document chunk naturally share the same vector_id. RAGdoll automatically deduplicates these shared IDs when:

Building the embedding index (prevents duplicate ID errors from Chroma)
Querying by embedding (returns all nodes sharing top embeddings without redundancy)
Traversing the graph (standard graph traversal logic)

So even though ragdoll/vector_stores and ragdoll/graph_stores live in separate packages, their lifecycle is tied together via the pipeline entry points shown above.

Installation

To install RAGdoll, follow these steps:

Stable version install

pip install python-ragdoll

Latest version install

Clone the Repository:

    git clone https://github.com/nsasto/RAGdoll.git
    cd RAGdoll

Install Dependencies:

    pip install -e .

This will install the required dependencies, including Langchain and Pydantic.

Installation with optional features

RAGdoll supports optional dependency groups for different use cases:

# Base install (core functionality only)
pip install -e .

# Development tools (testing, linting, formatting)
pip install -e .[dev]

# Entity extraction and NLP features (spaCy, sentence transformers, PDF processing)
pip install -e .[entity]

# Graph database support (Neo4j, RDF)
pip install -e .[graph]

# All optional features combined
pip install -e .[all]

From PyPI (recommended for production)

# Base install
pip install python-ragdoll

# With optional features
pip install python-ragdoll[all]  # or [dev], [entity], [graph]

Architecture

RAGdoll's architecture is built around modular components and abstract base classes, making it highly extensible. Here's an overview of the key modules:

Modules

loaders: Responsible for loading data from various sources (e.g., directories, JSON files, web pages).
chunkers: Handles the splitting of large text documents into smaller chunks.
embeddings: Provides an interface for embedding models, allowing you to generate vector representations of text.
vector_stores: Manages the storage and retrieval of vector embeddings.
llms: Provides an interface to interact with different large language models.
graph_stores: Manages the storage and querying of knowledge graphs.
retrieval: Provides vector, graph, and hybrid retrieval components.
entity_extraction: Extracts entities and relationships from documents to build knowledge graphs.
pipeline: Orchestrates ingestion, chunking, embedding, and graph building workflows.

Abstract Base Classes

Each module has an abstract base class (BaseLoader, BaseChunker, BaseVectorStore, BaseGraphStore, BaseRetriever) or protocol (the BaseLLMCaller interface) that defines a standard contract for that component type. Embeddings use LangChain's Embeddings interface directly.

Default Implementations

RAGdoll provides default implementations for most components, allowing you to quickly get started without having to write everything from scratch:

Langchain-Markitdown: A default loader for most major file types. See docs/loader_registry.md for information on the loader registry and how to register custom loader classes under short names.
RecursiveCharacterTextSplitter: A default text splitter.
OpenAIEmbeddings: Default embeddings that use OpenAI's API.
LangChain VectorStore factory: Plug-and-play wrapper for any LangChain vector store (Chroma, FAISS, etc.); see docs/vector_stores.md.
OpenAILLM: A default OpenAI LLM.
BaseGraphStore: A BaseGraphStore, it needs to be implemented.

Key Design Decisions

RAGdoll 2.0 embraces LangChain's ecosystem for maximum flexibility and maintainability:

Embeddings: LangChain Embeddings Objects

Decision: Use LangChain Embeddings objects directly instead of creating custom embedding classes
Rationale: LangChain provides robust, well-tested embedding implementations. Creating custom wrappers adds unnecessary complexity and maintenance burden.
Benefits: Immediate access to all LangChain embedding providers (OpenAI, HuggingFace, etc.), automatic updates, consistent APIs.
Implementation: ragdoll.embeddings.get_embedding_model reads your config and returns a ready-to-use LangChain embedding instance.

Vector Stores: LangChain VectorStore Interface

Decision: Accept any LangChain VectorStore object directly instead of requiring custom adapters
Rationale: LangChain supports 40+ vector stores with consistent interfaces. Custom adapters create maintenance overhead and limit ecosystem integration.
Benefits: Plug-and-play compatibility with any LangChain vector store (Chroma, FAISS, Pinecone, Weaviate, etc.), zero adapter code needed, future-proof with LangChain updates.
Implementation: BaseVectorStore wraps LangChain VectorStore objects and delegates operations.

This design maximizes ecosystem compatibility while keeping RAGdoll's core orchestration logic clean and focused.

System Diagram

For a visual walkthrough of how the ingestion, knowledge build, and query-time pieces connect, see the architecture diagram below (also available in docs/architecture.md):

graph TD
    subgraph Shared_Config["Bootstrap & Shared Services"]
        CFG["AppConfig / Config Manager"]
        CACHE["CacheManager"]
        METRICS["MetricsManager"]
        CFG --> CACHE
        CFG --> METRICS
    end

    subgraph Ingestion["Ingestion & Index Build"]
        SRC["Sources<br/>(files, URLs, loader registry)"] --> LOADER["DocumentLoaderService<br/>(auto loaders + caching + metrics)"]
        CACHE -.-> LOADER
        METRICS -.-> LOADER
        CFG --> LOADER
        LOADER --> DOCS["LangChain Documents"]
        DOCS --> CHUNK["Chunkers<br/>(split_documents)"]
        CFG --> CHUNK
        CHUNK --> EMB["Embedding Resolver<br/>(get_embedding_model)"]
        CFG --> EMB
        EMB --> VSTORE[("Vector Store<br/>Chroma/FAISS")]
        CHUNK --> ENT["EntityExtractionService<br/>(spaCy + LLM prompts)"]
        CFG --> ENT
        ENT --> GPERSIST["GraphPersistenceService<br/>(JSON/NetworkX/Neo4j)"]
        GPERSIST --> GRAPHSTORE[("Graph Store<br/>NetworkX/Neo4j")]
    end

    subgraph Query["Query & Reasoning"]
        USER["User Query"] --> RAG["Ragdoll Orchestrator"]
        CFG --> RAG

        subgraph Retrievers
            direction LR
            VR["VectorRetriever"]
            GR["GraphRetriever"]
            HR["HybridRetriever"]
        end

        RAG -- "uses" --> VR
        RAG -- "uses" --> GR
        RAG -- "uses" --> HR

        VR --> VSTORE
        GR --> GRAPHSTORE
        HR --> VR
        HR --> GR

        VR --> CONTEXT["Retrieved Chunks"]
        GR --> CONTEXT
        HR --> CONTEXT

        CONTEXT --> LLM["BaseLLMCaller"]
        RAG --> LLM
        CFG --> LLM
        LLM --> ANSWER["Answer / Structured Output"]
    end

    classDef service fill:#f9f,stroke:#333,stroke-width:1.5px;
    classDef storage fill:#dbeafe,stroke:#333,stroke-width:1.5px;
    classDef data fill:#fef3c7,stroke:#333,stroke-width:1.5px;
    classDef io fill:#fde68a,stroke:#333,stroke-width:1.5px;

    class CFG,LOADER,CHUNK,ENT,GPERSIST,RAG,LLM,VR,GR,HR service;
    class CACHE,METRICS,GRAPHSTORE,VSTORE storage;
    class DOCS,CONTEXT data;
    class SRC,USER,ANSWER io;

Extensibility

RAGdoll is designed to be highly extensible. You can easily create custom components by following these steps:

Subclass the Base Class: Create a new class that inherits from the relevant base class (e.g., BaseLoader, BaseEmbeddings).
Implement Abstract Methods: Implement the abstract methods defined in the base class to provide your custom functionality.
Integrate into RAGdoll: Pass an instance of your custom component to the Ragdoll class when you create it.

Configuration

RAGdoll uses Pydantic to manage its configuration. This allows for:

Data Validation: Automatic validation of configuration values.
Type Hints: Clear type definitions for configuration settings.
Default Values: Convenient default values for configuration options.

You can create a Config object and pass it to the Ragdoll class.

from ragdoll import settings
from ragdoll.ragdoll import Ragdoll

# Grab the shared AppConfig (respects RAGDOLL_CONFIG_PATH when set)
app = settings.get_app()
config = app.config
vector_stores = config._config.setdefault("vector_stores", {})
vector_stores.setdefault("default_store", "chroma")
stores = vector_stores.setdefault("stores", {})
chroma_settings = stores.setdefault("chroma", {})
chroma_settings.setdefault("params", {})["persist_directory"] = "./my_vectors"

# Create Ragdoll with this configuration
ragdoll = Ragdoll(app_config=app)

Entity Extraction Controls

The entity_extraction section of default_config.yaml exposes comprehensive controls for graph-centric workflows:

Extraction Methods

coreference_resolution_method: Choose how to resolve entity mentions (rule_based, llm, or none)
entity_extraction_methods: List methods for entity extraction (e.g., ["ner"] for spaCy NER, or add "llm" for model-based extraction)
relationship_extraction_method: How to extract relationships (typically llm)

Quality and Coverage

gleaning_enabled: Enable iterative extraction passes to discover missed entities/relationships (default: true)
max_gleaning_steps: Number of gleaning iterations (default: 2)
entity_linking_enabled: Merge similar entities (default: true)
entity_linking_method: Method for entity consolidation (default: string_similarity)
entity_linking_threshold: Similarity threshold for linking (default: 0.8)
postprocessing_steps: List of cleanup operations (e.g., ["merge_similar_entities", "normalize_relations"])

Prompt and Parsing

relationship_parsing: Choose the preferred output format (json, markdown, auto), optionally supply a custom parser class or schema, and pass parser-specific kwargs. This lets you tighten validation for LLM responses (e.g., point at your own Pydantic schema).
relationship_prompts: Declare a default prompt template plus per-provider overrides (e.g., map "anthropic" to a Claude-specific prompt). The service picks the prompt whose provider matches the active BaseLLMCaller.

Chunking

chunking_strategy: Use 'default' to use the chunker config, 'none' to disable chunking
chunk_size: Size of text chunks for extraction (default: 1000)
chunk_overlap: Overlap between chunks (default: 50)

Example excerpt:

entity_extraction:
  coreference_resolution_method: "rule_based"
  entity_extraction_methods: ["ner"]
  relationship_extraction_method: "llm"
  gleaning_enabled: true
  max_gleaning_steps: 2
  entity_linking_enabled: true
  entity_linking_method: "string_similarity"
  entity_linking_threshold: 0.8
  postprocessing_steps: ["merge_similar_entities", "normalize_relations"]
  relationship_parsing:
    preferred_format: "auto"
  relationship_prompts:
    default: "relationship_extraction"
    providers:
      openai: "relationship_extraction_openai"
      anthropic: "relationship_extraction_claude"

See docs/configuration.md and docs/entity_extraction.md for the full field reference.

Comparison with GraphRAG-style projects

This project began as a learning exercise; use the table below as a rough orientation (not marketing). Descriptions of other tools are based on their public docs - check their repos for specifics.

Aspect	GraphRAG (Microsoft)	NanoGraphRAG	Fast GraphRAG	RAGdoll (this repo)
Scope	Graph-first pipeline with multi-level community graphs and summaries	Lightweight, graph-centric variants; usually skip heavy hierarchy	Flat graph focus; optimized ingestion and traversal	Full RAG orchestrator (loaders + chunkers + embeddings + LLM + optional graph layer)
Entity/Relation extraction	LLM-heavy entity + relation extraction; extensive prompts	Simplified or minimal LLM extraction	Hybrid/heuristic extraction; typically flat graph	spaCy NER + prompt-based relationship extraction per chunk; YAML prompt/parser controls
Graph structure	Hierarchical communities + reports	Typically flat or shallow	Flat graph; no hierarchy	Flat graph only; persisted to JSON/NetworkX/Neo4j via `GraphPersistenceService`
Summaries	Precomputed community summaries	Often none or single-level	Often skipped; sometimes on-demand	None precomputed; summaries come from query-time LLM calls if you add them
Retrieval	Combines vector + hierarchical graph summaries	Lightweight graph or vector	Flat graph traversal; vector when configured	Vector-first; optional simple/Neo4j graph retriever from `ingest_with_graph`
Defaults	Cloud LLMs (GPT-4/O) with hierarchical post-processing	Small/local-friendly LLMs	Async/flat graph tuned for speed	LangChain defaults; OpenAI models by default but fully swappable to local/open-source
Benchmarks	Published externally	Vary by implementation	Vary by implementation	None published here—measure with your models/stores

Contributing

Contributions to RAGdoll are welcome! To contribute:

Fork the repository.
Create a new branch for your changes.
Make your changes and write tests.
Submit a pull request.

License

RAGdoll is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.2.3

Nov 26, 2025

2.2.2

Nov 25, 2025

2.2.1

Nov 25, 2025

2.2.0

Nov 25, 2025

2.1.0

Nov 14, 2025

1.2.0

May 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_ragdoll-2.2.3.tar.gz (143.9 kB view details)

Uploaded Nov 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

python_ragdoll-2.2.3-py3-none-any.whl (151.6 kB view details)

Uploaded Nov 26, 2025 Python 3

File details

Details for the file python_ragdoll-2.2.3.tar.gz.

File metadata

Download URL: python_ragdoll-2.2.3.tar.gz
Upload date: Nov 26, 2025
Size: 143.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for python_ragdoll-2.2.3.tar.gz
Algorithm	Hash digest
SHA256	`443042e3bfd42e4f81879106251f0ad717b133e6be84c679d141e84d99da25c9`
MD5	`10a410afd2e3228f19f21e56f88a2720`
BLAKE2b-256	`6d3e7d8c5be297218937347c0fe6793f9a07c882eca30e96efca3c3451a12337`

See more details on using hashes here.

Provenance

The following attestation bundles were made for python_ragdoll-2.2.3.tar.gz:

Publisher: python-publish.yml on nsasto/RAGdoll

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: python_ragdoll-2.2.3.tar.gz
- Subject digest: 443042e3bfd42e4f81879106251f0ad717b133e6be84c679d141e84d99da25c9
- Sigstore transparency entry: 726760730
- Sigstore integration time: Nov 26, 2025
Source repository:
- Permalink: nsasto/RAGdoll@5c695cf7f17c91544256dfaf166f22e3cf652e0e
- Branch / Tag: refs/heads/release
- Owner: https://github.com/nsasto
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@5c695cf7f17c91544256dfaf166f22e3cf652e0e
- Trigger Event: push

File details

Details for the file python_ragdoll-2.2.3-py3-none-any.whl.

File metadata

Download URL: python_ragdoll-2.2.3-py3-none-any.whl
Upload date: Nov 26, 2025
Size: 151.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for python_ragdoll-2.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8e3a2e93bdefabd808b6130b142df7fb0d3d14c9d28b24ac8002ecd93ed740c`
MD5	`a30a12e3fb48993e8f80fca32ff3dc32`
BLAKE2b-256	`b23cf1dac29beda252a7d99274652ebdc24063ead5c692359eca7d9ae52e503b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for python_ragdoll-2.2.3-py3-none-any.whl:

Publisher: python-publish.yml on nsasto/RAGdoll

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: python_ragdoll-2.2.3-py3-none-any.whl
- Subject digest: e8e3a2e93bdefabd808b6130b142df7fb0d3d14c9d28b24ac8002ecd93ed740c
- Sigstore transparency entry: 726760779
- Sigstore integration time: Nov 26, 2025
Source repository:
- Permalink: nsasto/RAGdoll@5c695cf7f17c91544256dfaf166f22e3cf652e0e
- Branch / Tag: refs/heads/release
- Owner: https://github.com/nsasto
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@5c695cf7f17c91544256dfaf166f22e3cf652e0e
- Trigger Event: push

python-ragdoll 2.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

RAGdoll: A Flexible and Extensible RAG Framework

🧭 Project Overview

How RAGdoll compares to GraphRAG-style tools

What's New

Enhanced Features in RAGdoll 2.1

Quick Start Guide

Demo Application

Running the Demo

Performance & Parallel Execution

Parallel Embeddings (Vector Store Layer)

Parallel Entity Extraction

Configuration Example

Modular Retrieval Architecture

VectorRetriever

GraphRetriever

HybridRetriever

Complete Example with Graph Pipeline

How Vector and Graph Stores Work Together

Installation

Stable version install

Latest version install

Installation with optional features

From PyPI (recommended for production)

Architecture

Modules

Abstract Base Classes

Default Implementations

Key Design Decisions

Embeddings: LangChain Embeddings Objects

Vector Stores: LangChain VectorStore Interface

System Diagram

Extensibility

Configuration

Entity Extraction Controls

Extraction Methods

Quality and Coverage

Prompt and Parsing

Chunking

Comparison with GraphRAG-style projects

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance