A set of helper classes that abstract some of the more common tasks of a typical RAG process including document loading/web scraping.

These details have not been verified by PyPI

Project description

Ragdoll

RAGdoll: A Flexible and Extensible RAG Framework

Welcome to Ragdoll 2.0! This release marks a significant overhaul of the Ragdoll project, focusing on enhanced flexibility, extensibility, and maintainability. We've completely refactored the core architecture to make it easier than ever to adapt Ragdoll to your specific needs and integrate it with the broader LangChain ecosystem. This document outlines the major changes and improvements you'll find in this new version.

🧭 Project Overview

RAGdoll 2 is an extensible framework for building Retrieval-Augmented Generation (RAG) applications. It provides a modular architecture that allows you to easily integrate various data sources, chunking strategies, embedding models, vector stores, large language models (LLMs), and graph stores. RAGdoll is designed to be flexible and fast, without any third party dependencies. It's also designed to accomodate a broad array of file types without any initial dependency on third party hosted services using langchain-markitdown. The loaders can easily be swapped out with any compatible lanchain loader when ready for production.

Note that RAGdoll 2 is a complete overhaul of the initial RAGdoll project and is not backwards compatible in any respect.

What's New

Enhanced Features in RAGdoll 2.0

This version of RAGdoll introduces several key features that improve the flexibility and usability of the framework:

Caching: RAGdoll now supports caching, allowing you to store and reuse results from previous operations. This can significantly speed up the execution of your RAG applications by avoiding redundant computations.
Auto Loader Selection: RAGdoll now includes loaders for multiple file types (not only pdf). The loader defaults to Langchain-Markitdown loaders, but can be configured to use any Lanchain compatible loader.
Monitoring: A new monitoring capability has been added to RAGdoll. This allows you to track and understand the performance and behavior of your RAG applications over time.

# Enable monitoring in config
monitor:
  enabled: true

Quick Start Guide

Here's a quick example of how to get started with RAGdoll using the new LLM caller abstraction:

from ragdoll.ragdoll import Ragdoll
from ragdoll.llms import get_llm_caller

# Resolve whichever model is marked as default in config (or pass a model name).
llm_caller = get_llm_caller()

# Spin up the orchestrator with sensible defaults.
ragdoll = Ragdoll(llm_caller=llm_caller)

# Ingest a few local files (vector store + caches handled automatically).
ragdoll.ingest_data(["path/to/document.md", "path/to/notes.pdf"])

# Run a retrieval + answer round trip.
result = ragdoll.query("What is the capital of France?")
print(result["answer"])

Need finer control over loaders or paths? Use settings.get_app() (or bootstrap_app with overrides) to obtain the shared AppConfig, tweak its config, and pass component overrides into Ragdoll.

Graph Retrieval Pipeline

When you enable entity_extraction.graph_retriever.enabled in your config, you can trigger the full ingestion pipeline (chunking, embeddings, entity extraction, graph persistence) and retrieve a knowledge-graph-aware retriever directly from the Ragdoll API:

import asyncio
from ragdoll.ragdoll import Ragdoll
from ragdoll.pipeline import IngestionOptions

async def main():
    ragdoll = Ragdoll()
    result = await ragdoll.ingest_with_graph(
        ["path/to/docs/manual.pdf"],
        options=IngestionOptions(parallel_extraction=False),
    )
    print(result["stats"])           # ingestion metrics
    print(result["graph"])           # pydantic Graph object
    retriever = result["graph_retriever"]
    answers = retriever.invoke("How does the widget fail-safe work?")

asyncio.run(main())

The helper ingest_with_graph_sync() wraps asyncio.run() for scripts that are not already running an event loop. See examples/graph_retriever_example.py for a complete runnable script.

How Vector and Graph Stores Work Together

Ragdoll keeps both storage backends under the same orchestration surface:

Ragdoll.ingest_data(...) (or the lower-level IngestionPipeline) always loads documents, chunks them, embeds each chunk, and writes those embeddings into the configured vector store.
When entity_extraction.extract_entities (or entity_extraction.graph_retriever.enabled) is true, the same pipeline also fans out chunks to the entity extraction service, which generates a graph, persists it through the configured graph store, and can return a graph-aware retriever.
Both flows are coordinated inside IngestionPipeline: it receives the shared AppConfig, builds the ingestion service, embedding model, vector store, and optionally graph store, and emits stats/retrievers back through Ragdoll.

So even though ragdoll/vector_stores and ragdoll/graph_stores live in separate packages, their lifecycle is tied together via the pipeline entry points shown above.

Installation

To install RAGdoll, follow these steps:

Stable version install

pip install python-ragdoll

Latest version install

Clone the Repository:

    git clone https://github.com/nsasto/RAGdoll.git
    cd RAGdoll

Install Dependencies:

    pip install -e .

This will install the required dependencies, including Langchain and Pydantic.

Installation with optional features

RAGdoll supports optional dependency groups for different use cases:

# Base install (core functionality only)
pip install -e .

# Development tools (testing, linting, formatting)
pip install -e .[dev]

# Entity extraction and NLP features (spaCy, sentence transformers, PDF processing)
pip install -e .[entity]

# Graph database support (Neo4j, RDF)
pip install -e .[graph]

# All optional features combined
pip install -e .[all]

From PyPI (recommended for production)

# Base install
pip install python-ragdoll

# With optional features
pip install python-ragdoll[all]  # or [dev], [entity], [graph]

Architecture

RAGdoll's architecture is built around modular components and abstract base classes, making it highly extensible. Here's an overview of the key modules:

Modules

loaders: Responsible for loading data from various sources (e.g., directories, JSON files, web pages).
chunkers: Handles the splitting of large text documents into smaller chunks.
embeddings: Provides an interface for embedding models, allowing you to generate vector representations of text.
vector_stores: Manages the storage and retrieval of vector embeddings.
llms: Provides an interface to interact with different large language models.
graph_stores: Manages the storage and querying of knowledge graphs.
chains: Defines different types of chains, like retrieval QA (not implemented)

Abstract Base Classes

Each module has an abstract base class (BaseLoader, BaseChunker, BaseEmbeddings, BaseVectorStore, BaseGraphStore, BaseChain) or protocol (the BaseLLMCaller interface) that defines a standard contract for that component type.

Default Implementations

RAGdoll provides default implementations for most components, allowing you to quickly get started without having to write everything from scratch:

Langchain-Markitdown: A default loader for most major file types. See docs/loader_registry.md for information on the loader registry and how to register custom loader classes under short names.
RecursiveCharacterTextSplitter: A default text splitter.
OpenAIEmbeddings: Default embeddings that use OpenAI's API.
LangChain VectorStore factory: Plug-and-play wrapper for any LangChain vector store (Chroma, FAISS, etc.); see docs/vector_stores.md.
OpenAILLM: A default OpenAI LLM.
BaseGraphStore: A BaseGraphStore, it needs to be implemented.

Key Design Decisions

RAGdoll 2.0 embraces LangChain's ecosystem for maximum flexibility and maintainability:

Embeddings: LangChain Embeddings Objects

Decision: Use LangChain Embeddings objects directly instead of creating custom embedding classes
Rationale: LangChain provides robust, well-tested embedding implementations. Creating custom wrappers adds unnecessary complexity and maintenance burden.
Benefits: Immediate access to all LangChain embedding providers (OpenAI, HuggingFace, etc.), automatic updates, consistent APIs.
Implementation: ragdoll.embeddings.get_embedding_model reads your config and returns a ready-to-use LangChain embedding instance.

Vector Stores: LangChain VectorStore Interface

Decision: Accept any LangChain VectorStore object directly instead of requiring custom adapters
Rationale: LangChain supports 40+ vector stores with consistent interfaces. Custom adapters create maintenance overhead and limit ecosystem integration.
Benefits: Plug-and-play compatibility with any LangChain vector store (Chroma, FAISS, Pinecone, Weaviate, etc.), zero adapter code needed, future-proof with LangChain updates.
Implementation: BaseVectorStore wraps LangChain VectorStore objects and delegates operations.

This design maximizes ecosystem compatibility while keeping RAGdoll's core orchestration logic clean and focused.

System Diagram

For a visual walkthrough of how the ingestion, knowledge build, and query-time pieces connect, see the architecture diagram below (also available in docs/architecture.md):

graph TD
    %% Ingestion + Chunking
    subgraph Ingestion
        A["Input sources<br/>(files, URLs, loaders)"] --> B["Loader pipeline"]
        B --> C["Chunking Service<br/>(BaseChunkingService + plugins)"]
    end
    C --> D["Chunks (GTChunk)"]

    %% Knowledge Construction
    subgraph Knowledge_Build
        D --> E["Information Extraction<br/>(entities & relations)"]
        E --> F["Knowledge Graph Upsert<br/>(policy interface)"]
        E --> G["Embedding Pipeline<br/>(single pass)"]
        G --> H["VectorStoreAdapter<br/>(dynamic class e.g. Chroma/Hnswlib)"]
        F --> I(("Graph Storage<br/>(IGraphStorage, Neo4j, etc.)"))
        H --> J(("Vector DB"))
    end

    %% Query + Reasoning
    subgraph Query_Runtime
        Q["User Query"] --> R["State Manager<br/>(retrieval orchestration)"]
        R --> H
        R --> I
        R --> S["Context Assembly<br/>(chunks + KG facts)"]
        S --> T["Prompt Builder"]
        T --> U["LLM Caller<br/>(LangChain adapters)"]
        U --> V["Answer"]
    end

    style A fill:#ccf,stroke:#333,stroke-width:2px
    style B fill:#ccf,stroke:#333,stroke-width:2px
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#fef3c7,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px
    style G fill:#f9f,stroke:#333,stroke-width:2px
    style H fill:#d1fae5,stroke:#333,stroke-width:2px
    style I fill:#dbeafe,stroke:#333,stroke-width:2px
    style J fill:#dbeafe,stroke:#333,stroke-width:2px
    style Q fill:#fde68a,stroke:#333,stroke-width:2px
    style R fill:#f9f,stroke:#333,stroke-width:2px
    style S fill:#fef3c7,stroke:#333,stroke-width:2px
    style T fill:#f9f,stroke:#333,stroke-width:2px
    style U fill:#e0e7ff,stroke:#333,stroke-width:2px
    style V fill:#ccf,stroke:#333,stroke-width:2px

Extensibility

RAGdoll is designed to be highly extensible. You can easily create custom components by following these steps:

Subclass the Base Class: Create a new class that inherits from the relevant base class (e.g., BaseLoader, BaseEmbeddings).
Implement Abstract Methods: Implement the abstract methods defined in the base class to provide your custom functionality.
Integrate into RAGdoll: Pass an instance of your custom component to the Ragdoll class when you create it.

Configuration

RAGdoll uses Pydantic to manage its configuration. This allows for:

Data Validation: Automatic validation of configuration values.
Type Hints: Clear type definitions for configuration settings.
Default Values: Convenient default values for configuration options.

You can create a Config object and pass it to the Ragdoll class.

from ragdoll import settings
from ragdoll.ragdoll import Ragdoll

# Grab the shared AppConfig (respects RAGDOLL_CONFIG_PATH when set)
app = settings.get_app()
config = app.config
config._config["vector_store"]["params"]["persist_directory"] = "./my_vectors"

# Create Ragdoll with this configuration
ragdoll = Ragdoll(app_config=app)

Entity Extraction Controls

The entity_extraction section of default_config.yaml now exposes several knobs for graph-centric workflows:

relationship_parsing: choose the preferred output format (json, markdown, auto), optionally supply a custom parser class or schema, and pass parser-specific kwargs. This lets you tighten validation for LLM responses (e.g., point at your own Pydantic schema).
relationship_prompts: declare a default prompt template plus per-provider overrides (e.g., map "anthropic" to a Claude-specific prompt). The service picks the prompt whose provider matches the active BaseLLMCaller.
graph_retriever: enable creation of a graph retriever after entity extraction, select the backend (simple or neo4j/langchain_neo4j), and tune parameters like top_k or include_edges. When enabled, EntityExtractionService and IngestionPipeline expose a retriever you can plug into downstream chains.

Example excerpt:

entity_extraction:
  relationship_parsing:
    preferred_format: "markdown"
    schema: "my_project.schemas.RelationshipListV2"
  relationship_prompts:
    default: "relationship_extraction"
    providers:
      openai: "relationship_extraction_openai"
      anthropic: "relationship_extraction_claude"
  graph_retriever:
    enabled: true
    backend: "neo4j"
    top_k: 10

See docs/configuration.md for the full field reference.

Contributing

Contributions to RAGdoll are welcome! To contribute:

Fork the repository.
Create a new branch for your changes.
Make your changes and write tests.
Submit a pull request.

License

RAGdoll is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.2.3

Nov 26, 2025

2.2.2

Nov 25, 2025

2.2.1

Nov 25, 2025

2.2.0

Nov 25, 2025

This version

2.1.0

Nov 14, 2025

1.2.0

May 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_ragdoll-2.1.0.tar.gz (88.5 kB view details)

Uploaded Nov 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

python_ragdoll-2.1.0-py3-none-any.whl (105.6 kB view details)

Uploaded Nov 14, 2025 Python 3

File details

Details for the file python_ragdoll-2.1.0.tar.gz.

File metadata

Download URL: python_ragdoll-2.1.0.tar.gz
Upload date: Nov 14, 2025
Size: 88.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for python_ragdoll-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`71f2a4f84b9731c9274d65eaa324b92dc1a9c330984befc6f075f7aa88c6f1bd`
MD5	`c3f3b666a4a5752f6b04edf2682c9e17`
BLAKE2b-256	`837f34e158109b7ba4bd9c17b37db6b4788b8c9909e1c8487920ed929b6dee89`

See more details on using hashes here.

Provenance

The following attestation bundles were made for python_ragdoll-2.1.0.tar.gz:

Publisher: python-publish.yml on nsasto/RAGdoll

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: python_ragdoll-2.1.0.tar.gz
- Subject digest: 71f2a4f84b9731c9274d65eaa324b92dc1a9c330984befc6f075f7aa88c6f1bd
- Sigstore transparency entry: 701199213
- Sigstore integration time: Nov 14, 2025
Source repository:
- Permalink: nsasto/RAGdoll@7991328f0f9507972004e8596d202602ba6ea613
- Branch / Tag: refs/heads/release
- Owner: https://github.com/nsasto
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@7991328f0f9507972004e8596d202602ba6ea613
- Trigger Event: push

File details

Details for the file python_ragdoll-2.1.0-py3-none-any.whl.

File metadata

Download URL: python_ragdoll-2.1.0-py3-none-any.whl
Upload date: Nov 14, 2025
Size: 105.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for python_ragdoll-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4886376ea3a96d9df27e57a7a9d1b8bee6c0d22c86fc0f8efc4bbd10dad8d784`
MD5	`d9d61d97d3046ebf39ae7a80eeebb07d`
BLAKE2b-256	`5b1300609e15d912277c9c813d3cf1fbf7e5b1dd4f5bc0fe20c568f8b2a2e0e1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for python_ragdoll-2.1.0-py3-none-any.whl:

Publisher: python-publish.yml on nsasto/RAGdoll

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: python_ragdoll-2.1.0-py3-none-any.whl
- Subject digest: 4886376ea3a96d9df27e57a7a9d1b8bee6c0d22c86fc0f8efc4bbd10dad8d784
- Sigstore transparency entry: 701199225
- Sigstore integration time: Nov 14, 2025
Source repository:
- Permalink: nsasto/RAGdoll@7991328f0f9507972004e8596d202602ba6ea613
- Branch / Tag: refs/heads/release
- Owner: https://github.com/nsasto
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@7991328f0f9507972004e8596d202602ba6ea613
- Trigger Event: push

python-ragdoll 2.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

RAGdoll: A Flexible and Extensible RAG Framework

🧭 Project Overview

What's New

Enhanced Features in RAGdoll 2.0

Quick Start Guide

Graph Retrieval Pipeline

How Vector and Graph Stores Work Together

Installation

Stable version install

Latest version install

Installation with optional features

From PyPI (recommended for production)

Architecture

Modules

Abstract Base Classes

Default Implementations

Key Design Decisions

Embeddings: LangChain Embeddings Objects

Vector Stores: LangChain VectorStore Interface

System Diagram

Extensibility

Configuration

Entity Extraction Controls

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance