A set of helper classes that abstract some of the more common tasks of a typical RAG process including document loading/web scraping.
Project description
RAGdoll: A Flexible and Extensible RAG Framework
Welcome to Ragdoll 2.0! This release marks a significant overhaul of the Ragdoll project, focusing on enhanced flexibility, extensibility, and maintainability. We've completely refactored the core architecture to make it easier than ever to adapt Ragdoll to your specific needs and integrate it with the broader LangChain ecosystem. This document outlines the major changes and improvements you'll find in this new version.
🧭 Project Overview
RAGdoll 2 is an extensible framework for building Retrieval-Augmented Generation (RAG) applications. It provides a modular architecture that allows you to easily integrate various data sources, chunking strategies, embedding models, vector stores, large language models (LLMs), and graph stores. RAGdoll is designed to be flexible and fast, without any third party dependencies. It's also designed to accomodate a broad array of file types without any initial dependency on third party hosted services using langchain-markitdown. The loaders can easily be swapped out with any compatible lanchain loader when ready for production.
Note that RAGdoll 2 is a complete overhaul of the initial RAGdoll project and is not backwards compatible in any respect.
What's New
Enhanced Features in RAGdoll 2.0
This version of RAGdoll introduces several key features that improve the flexibility and usability of the framework:
- Caching: RAGdoll now supports caching, allowing you to store and reuse results from previous operations. This can significantly speed up the execution of your RAG applications by avoiding redundant computations.
- Auto Loader Selection: RAGdoll now includes loaders for multiple file types (not only pdf). The loader defaults to Langchain-Markitdown loaders, but can be configured to use any Lanchain compatible loader.
- Monitoring: A new monitoring capability has been added to RAGdoll. This allows you to track and understand the performance and behavior of your RAG applications over time.
# Enable monitoring in config
monitor:
enabled: true
Quick Start Guide
Here's a quick example of how to get started with RAGdoll using the new LLM caller abstraction:
from ragdoll.ragdoll import Ragdoll
from ragdoll.llms import get_llm_caller
# Resolve whichever model is marked as default in config (or pass a model name).
llm_caller = get_llm_caller()
# Spin up the orchestrator with sensible defaults.
ragdoll = Ragdoll(llm_caller=llm_caller)
# Ingest a few local files (vector store + caches handled automatically).
ragdoll.ingest_data(["path/to/document.md", "path/to/notes.pdf"])
# Run a retrieval + answer round trip.
result = ragdoll.query("What is the capital of France?")
print(result["answer"])
Need finer control over loaders or paths? Use settings.get_app() (or bootstrap_app with overrides) to obtain the shared AppConfig, tweak its config, and pass component overrides into Ragdoll.
Graph Retrieval Pipeline
When you enable entity_extraction.graph_retriever.enabled in your config, you can trigger the full ingestion pipeline (chunking, embeddings, entity extraction, graph persistence) and retrieve a knowledge-graph-aware retriever directly from the Ragdoll API:
import asyncio
from ragdoll.ragdoll import Ragdoll
from ragdoll.pipeline import IngestionOptions
async def main():
ragdoll = Ragdoll()
result = await ragdoll.ingest_with_graph(
["path/to/docs/manual.pdf"],
options=IngestionOptions(parallel_extraction=False),
)
print(result["stats"]) # ingestion metrics
print(result["graph"]) # pydantic Graph object
retriever = result["graph_retriever"]
answers = retriever.invoke("How does the widget fail-safe work?")
asyncio.run(main())
The helper ingest_with_graph_sync() wraps asyncio.run() for scripts that are not already running an event loop.
See examples/graph_retriever_example.py for a complete runnable script.
How Vector and Graph Stores Work Together
Ragdoll keeps both storage backends under the same orchestration surface:
Ragdoll.ingest_data(...)(or the lower-levelIngestionPipeline) always loads documents, chunks them, embeds each chunk, and writes those embeddings into the configured vector store.- When
entity_extraction.extract_entities(orentity_extraction.graph_retriever.enabled) is true, the same pipeline also fans out chunks to the entity extraction service, which generates a graph, persists it through the configured graph store, and can return a graph-aware retriever. - Both flows are coordinated inside
IngestionPipeline: it receives the sharedAppConfig, builds the ingestion service, embedding model, vector store, and optionally graph store, and emits stats/retrievers back throughRagdoll.
So even though ragdoll/vector_stores and ragdoll/graph_stores live in separate packages, their lifecycle is tied together via the pipeline entry points shown above.
Installation
To install RAGdoll, follow these steps:
Stable version install
pip install python-ragdoll
Latest version install
- Clone the Repository:
git clone https://github.com/nsasto/RAGdoll.git
cd RAGdoll
- Install Dependencies:
pip install -e .
This will install the required dependencies, including Langchain and Pydantic.
Installation with optional features
RAGdoll supports optional dependency groups for different use cases:
# Base install (core functionality only)
pip install -e .
# Development tools (testing, linting, formatting)
pip install -e .[dev]
# Entity extraction and NLP features (spaCy, sentence transformers, PDF processing)
pip install -e .[entity]
# Graph database support (Neo4j, RDF)
pip install -e .[graph]
# All optional features combined
pip install -e .[all]
From PyPI (recommended for production)
# Base install
pip install python-ragdoll
# With optional features
pip install python-ragdoll[all] # or [dev], [entity], [graph]
Architecture
RAGdoll's architecture is built around modular components and abstract base classes, making it highly extensible. Here's an overview of the key modules:
Modules
loaders: Responsible for loading data from various sources (e.g., directories, JSON files, web pages).chunkers: Handles the splitting of large text documents into smaller chunks.embeddings: Provides an interface for embedding models, allowing you to generate vector representations of text.vector_stores: Manages the storage and retrieval of vector embeddings.llms: Provides an interface to interact with different large language models.graph_stores: Manages the storage and querying of knowledge graphs.chains: Defines different types of chains, like retrieval QA (not implemented)
Abstract Base Classes
Each module has an abstract base class (BaseLoader, BaseChunker, BaseEmbeddings, BaseVectorStore, BaseGraphStore, BaseChain) or protocol (the BaseLLMCaller interface) that defines a standard contract for that component type.
Default Implementations
RAGdoll provides default implementations for most components, allowing you to quickly get started without having to write everything from scratch:
Langchain-Markitdown: A default loader for most major file types. Seedocs/loader_registry.mdfor information on the loader registry and how to register custom loader classes under short names.RecursiveCharacterTextSplitter: A default text splitter.OpenAIEmbeddings: Default embeddings that use OpenAI's API.LangChain VectorStore factory: Plug-and-play wrapper for any LangChain vector store (Chroma, FAISS, etc.); seedocs/vector_stores.md.OpenAILLM: A default OpenAI LLM.BaseGraphStore: A BaseGraphStore, it needs to be implemented.
Key Design Decisions
RAGdoll 2.0 embraces LangChain's ecosystem for maximum flexibility and maintainability:
Embeddings: LangChain Embeddings Objects
- Decision: Use LangChain
Embeddingsobjects directly instead of creating custom embedding classes - Rationale: LangChain provides robust, well-tested embedding implementations. Creating custom wrappers adds unnecessary complexity and maintenance burden.
- Benefits: Immediate access to all LangChain embedding providers (OpenAI, HuggingFace, etc.), automatic updates, consistent APIs.
- Implementation:
ragdoll.embeddings.get_embedding_modelreads your config and returns a ready-to-use LangChain embedding instance.
Vector Stores: LangChain VectorStore Interface
- Decision: Accept any LangChain
VectorStoreobject directly instead of requiring custom adapters - Rationale: LangChain supports 40+ vector stores with consistent interfaces. Custom adapters create maintenance overhead and limit ecosystem integration.
- Benefits: Plug-and-play compatibility with any LangChain vector store (Chroma, FAISS, Pinecone, Weaviate, etc.), zero adapter code needed, future-proof with LangChain updates.
- Implementation:
BaseVectorStorewraps LangChainVectorStoreobjects and delegates operations.
This design maximizes ecosystem compatibility while keeping RAGdoll's core orchestration logic clean and focused.
System Diagram
For a visual walkthrough of how the ingestion, knowledge build, and query-time pieces connect, see the architecture diagram below (also available in docs/architecture.md):
graph TD
%% Ingestion + Chunking
subgraph Ingestion
A["Input sources<br/>(files, URLs, loaders)"] --> B["Loader pipeline"]
B --> C["Chunking Service<br/>(BaseChunkingService + plugins)"]
end
C --> D["Chunks (GTChunk)"]
%% Knowledge Construction
subgraph Knowledge_Build
D --> E["Information Extraction<br/>(entities & relations)"]
E --> F["Knowledge Graph Upsert<br/>(policy interface)"]
E --> G["Embedding Pipeline<br/>(single pass)"]
G --> H["VectorStoreAdapter<br/>(dynamic class e.g. Chroma/Hnswlib)"]
F --> I(("Graph Storage<br/>(IGraphStorage, Neo4j, etc.)"))
H --> J(("Vector DB"))
end
%% Query + Reasoning
subgraph Query_Runtime
Q["User Query"] --> R["State Manager<br/>(retrieval orchestration)"]
R --> H
R --> I
R --> S["Context Assembly<br/>(chunks + KG facts)"]
S --> T["Prompt Builder"]
T --> U["LLM Caller<br/>(LangChain adapters)"]
U --> V["Answer"]
end
style A fill:#ccf,stroke:#333,stroke-width:2px
style B fill:#ccf,stroke:#333,stroke-width:2px
style C fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#fef3c7,stroke:#333,stroke-width:2px
style E fill:#f9f,stroke:#333,stroke-width:2px
style F fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#f9f,stroke:#333,stroke-width:2px
style H fill:#d1fae5,stroke:#333,stroke-width:2px
style I fill:#dbeafe,stroke:#333,stroke-width:2px
style J fill:#dbeafe,stroke:#333,stroke-width:2px
style Q fill:#fde68a,stroke:#333,stroke-width:2px
style R fill:#f9f,stroke:#333,stroke-width:2px
style S fill:#fef3c7,stroke:#333,stroke-width:2px
style T fill:#f9f,stroke:#333,stroke-width:2px
style U fill:#e0e7ff,stroke:#333,stroke-width:2px
style V fill:#ccf,stroke:#333,stroke-width:2px
Extensibility
RAGdoll is designed to be highly extensible. You can easily create custom components by following these steps:
- Subclass the Base Class: Create a new class that inherits from the relevant base class (e.g.,
BaseLoader,BaseEmbeddings). - Implement Abstract Methods: Implement the abstract methods defined in the base class to provide your custom functionality.
- Integrate into RAGdoll: Pass an instance of your custom component to the
Ragdollclass when you create it.
Configuration
RAGdoll uses Pydantic to manage its configuration. This allows for:
- Data Validation: Automatic validation of configuration values.
- Type Hints: Clear type definitions for configuration settings.
- Default Values: Convenient default values for configuration options.
You can create a Config object and pass it to the Ragdoll class.
from ragdoll import settings
from ragdoll.ragdoll import Ragdoll
# Grab the shared AppConfig (respects RAGDOLL_CONFIG_PATH when set)
app = settings.get_app()
config = app.config
config._config["vector_store"]["params"]["persist_directory"] = "./my_vectors"
# Create Ragdoll with this configuration
ragdoll = Ragdoll(app_config=app)
Entity Extraction Controls
The entity_extraction section of default_config.yaml now exposes several knobs for graph-centric workflows:
relationship_parsing: choose the preferred output format (json,markdown,auto), optionally supply a custom parser class or schema, and pass parser-specific kwargs. This lets you tighten validation for LLM responses (e.g., point at your own Pydantic schema).relationship_prompts: declare a default prompt template plus per-provider overrides (e.g., map"anthropic"to a Claude-specific prompt). The service picks the prompt whose provider matches the activeBaseLLMCaller.graph_retriever: enable creation of a graph retriever after entity extraction, select the backend (simpleorneo4j/langchain_neo4j), and tune parameters liketop_korinclude_edges. When enabled,EntityExtractionServiceandIngestionPipelineexpose a retriever you can plug into downstream chains.
Example excerpt:
entity_extraction:
relationship_parsing:
preferred_format: "markdown"
schema: "my_project.schemas.RelationshipListV2"
relationship_prompts:
default: "relationship_extraction"
providers:
openai: "relationship_extraction_openai"
anthropic: "relationship_extraction_claude"
graph_retriever:
enabled: true
backend: "neo4j"
top_k: 10
See docs/configuration.md for the full field reference.
Contributing
Contributions to RAGdoll are welcome! To contribute:
- Fork the repository.
- Create a new branch for your changes.
- Make your changes and write tests.
- Submit a pull request.
License
RAGdoll is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file python_ragdoll-2.1.0.tar.gz.
File metadata
- Download URL: python_ragdoll-2.1.0.tar.gz
- Upload date:
- Size: 88.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71f2a4f84b9731c9274d65eaa324b92dc1a9c330984befc6f075f7aa88c6f1bd
|
|
| MD5 |
c3f3b666a4a5752f6b04edf2682c9e17
|
|
| BLAKE2b-256 |
837f34e158109b7ba4bd9c17b37db6b4788b8c9909e1c8487920ed929b6dee89
|
Provenance
The following attestation bundles were made for python_ragdoll-2.1.0.tar.gz:
Publisher:
python-publish.yml on nsasto/RAGdoll
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
python_ragdoll-2.1.0.tar.gz -
Subject digest:
71f2a4f84b9731c9274d65eaa324b92dc1a9c330984befc6f075f7aa88c6f1bd - Sigstore transparency entry: 701199213
- Sigstore integration time:
-
Permalink:
nsasto/RAGdoll@7991328f0f9507972004e8596d202602ba6ea613 -
Branch / Tag:
refs/heads/release - Owner: https://github.com/nsasto
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7991328f0f9507972004e8596d202602ba6ea613 -
Trigger Event:
push
-
Statement type:
File details
Details for the file python_ragdoll-2.1.0-py3-none-any.whl.
File metadata
- Download URL: python_ragdoll-2.1.0-py3-none-any.whl
- Upload date:
- Size: 105.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4886376ea3a96d9df27e57a7a9d1b8bee6c0d22c86fc0f8efc4bbd10dad8d784
|
|
| MD5 |
d9d61d97d3046ebf39ae7a80eeebb07d
|
|
| BLAKE2b-256 |
5b1300609e15d912277c9c813d3cf1fbf7e5b1dd4f5bc0fe20c568f8b2a2e0e1
|
Provenance
The following attestation bundles were made for python_ragdoll-2.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on nsasto/RAGdoll
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
python_ragdoll-2.1.0-py3-none-any.whl -
Subject digest:
4886376ea3a96d9df27e57a7a9d1b8bee6c0d22c86fc0f8efc4bbd10dad8d784 - Sigstore transparency entry: 701199225
- Sigstore integration time:
-
Permalink:
nsasto/RAGdoll@7991328f0f9507972004e8596d202602ba6ea613 -
Branch / Tag:
refs/heads/release - Owner: https://github.com/nsasto
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7991328f0f9507972004e8596d202602ba6ea613 -
Trigger Event:
push
-
Statement type: