Enterprise-grade Python RAG stack — loaders, cleaning, chunking, embedding, and vector store in one composable library
Project description
RAGSTACK
Enterprise-grade Python RAG (Retrieval-Augmented Generation) toolkit.
RAGSTACK is a composable, open-source SDK for building document ingestion pipelines. It handles everything between a raw file and a vector database: loading, cleaning, chunking, embedding, and storing — each stage independently usable and swappable.
Built for engineers who want production-level RAG infrastructure without vendor lock-in.
What Problem Does It Solve?
When building AI applications that answer questions from documents (contracts, reports, manuals, etc.), you need a reliable pipeline to:
- Extract text from files (PDF, DOCX, CSV, etc.)
- Clean that text (strip noise, fix encoding, redact PII)
- Split it into chunks a model can process
- Convert chunks to vector embeddings
- Store and search those embeddings
Most tutorials wire this up with ad-hoc code. RAGSTACK gives you production-grade, tested building blocks for each stage — composable, type-safe, and pluggable.
Architecture Overview
┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌──────────────┐
│ LOADERS │ --> │ CLEANERS │ --> │ CHUNKERS │ --> │ EMBEDDERS │ --> │ STORES │
│ │ │ │ │ │ │ │ │ │
│ PDF / DOCX │ │ Strip HTML │ │ Fixed-size │ │ OpenAI API │ │ pgvector │
│ TXT / CSV │ │ Fix encoding │ │ token-based │ │ Local model │ │ Qdrant │
│ Excel / MD │ │ Remove PII │ │ with overlap │ │ (HuggingFace)│ │ Chroma │
└─────────────┘ └──────────────┘ └───────────────┘ └──────────────┘ └──────────────┘
│ │ │ │ │
DocumentInfo DocumentBlock DocumentChunk float vectors SearchResult
+ DocumentBlock (cleaned) (with metadata) (1536 or 384d) (with scores)
Every stage operates on well-defined Pydantic models. You can use the full pipeline or drop in at any stage.
Folder Structure
ragstack1/
│
├── src/
│ └── ragstack_core/ # The core SDK package
│ │
│ ├── models/ # Shared data shapes (Pydantic)
│ │ ├── document_info.py # File metadata: id, name, type, size, timestamps
│ │ ├── document_block.py # Raw extracted text block from a loader
│ │ ├── document_chunk.py # A chunk ready for embedding (has chunk_id, token_count)
│ │ ├── embedding_record.py # A chunk paired with its vector
│ │ └── search_result.py # A search hit with similarity score
│ │
│ ├── loaders/ # File → DocumentInfo + DocumentBlocks
│ │ ├── base_loader.py # Abstract base class (load_info, load_blocks)
│ │ ├── text_loader.py # .txt files, N lines per block
│ │ ├── pdf_loader.py # .pdf via pypdf, N pages per block
│ │ ├── csv_loader.py # .csv rows serialised as "key: value | ..."
│ │ ├── excel_loader.py # .xlsx multi-sheet via openpyxl
│ │ └── markdown_loader.py # .md split by heading sections
│ │
│ ├── cleaners/ # Text normalisation pipeline
│ │ ├── pipeline.py # TextCleaningPipeline — ordered list of steps
│ │ ├── base_cleaner.py # CleanerStep Protocol + CleanContext + CleaningResult
│ │ └── steps/ # One file per cleaning concern
│ │ ├── whitespace_normalizer.py
│ │ ├── unicode_normalizer.py
│ │ ├── html_tag_stripper.py
│ │ ├── pdf_artifact_cleaner.py
│ │ ├── markdown_cleaner.py
│ │ ├── encoding_fixer.py
│ │ ├── control_char_cleaner.py
│ │ ├── typography_cleaner.py
│ │ ├── ligature_expander.py
│ │ └── pii_redactor.py
│ │
│ ├── chunkers/ # DocumentBlock → DocumentChunks
│ │ ├── base_chunker.py # Abstract base class
│ │ └── fixed_size_chunker.py # Token-based chunking with overlap (tiktoken)
│ │
│ ├── embedders/ # Text → float vectors
│ │ ├── base_embedder.py # EmbedderProtocol definition
│ │ ├── factory.py # create_embedder() — the public entry point
│ │ ├── openai_embedder.py # OpenAI text-embedding-3-small (1536d)
│ │ └── local_embedder.py # HuggingFace all-MiniLM-L6-v2 (384d), no API key
│ │
│ ├── stores/ # Vector storage + similarity search
│ │ ├── base_store.py # VectorStoreProtocol definition
│ │ ├── factory.py # create_store() — the public entry point
│ │ ├── pgvector_store.py # PostgreSQL + pgvector (production)
│ │ ├── qdrant_store.py # Qdrant (supports :memory: for dev)
│ │ ├── chroma_store.py # ChromaDB (supports :memory: for dev)
│ │ └── schema.sql # Run once to set up pgvector table
│ │
│ └── exceptions.py # EmbeddingError, StorageError, MissingDependencyError
│
├── src/tests/ # Pytest test suite (mirrors src/ragstack_core/)
├── examples/ # Runnable examples for every module
│ ├── loaders.py
│ ├── cleaning.py
│ ├── chunking.py
│ ├── embedding.py
│ ├── vector_store.py
│ └── full_pipeline.py # End-to-end demo
│
├── main.py # Placeholder entry point
├── pyproject.toml # Package definition + optional dependencies
└── uv.lock # Locked dependency versions
Why this structure?
Each folder is a stage in the pipeline and a separate concern. You can:
- Use only the loaders (extract text from files, nothing else)
- Use loaders + cleaners (extract and normalise)
- Skip straight to chunking if you already have text
Nothing in loaders/ depends on stores/. Nothing in cleaners/ knows about embeddings. This separation lets you swap any stage without touching the others — the definition of clean architecture.
Integrated Packages
| Package | Purpose | When it's needed |
|---|---|---|
pydantic |
Data validation and type-safe models | Always (core models) |
pypdf |
PDF text extraction | Loading .pdf files |
openpyxl |
Excel file reading | Loading .xlsx files |
tiktoken |
Token counting (OpenAI's tokeniser) | All chunking |
ftfy |
Fix broken Unicode / encoding errors | Text cleaning |
openai |
Embedding API calls | EmbeddingProvider.OPENAI |
sentence-transformers |
Local HuggingFace embeddings | EmbeddingProvider.LOCAL |
psycopg[pool] + pgvector |
PostgreSQL vector storage | VectorStoreProvider.PGVECTOR |
qdrant-client |
Qdrant vector storage | VectorStoreProvider.QDRANT |
chromadb |
ChromaDB vector storage | VectorStoreProvider.CHROMA |
pytest + pytest-asyncio |
Testing | Development only |
Core packages (pydantic, pypdf, openpyxl, tiktoken, ftfy) are always installed.
Optional packages are installed only when you need them — see Installation below.
Installation
Prerequisites
- Python 3.12+
uv(recommended) orpip
Step 1 — Clone the repo
git clone https://github.com/your-org/ragstack.git
cd ragstack
Step 2 — Install with uv (recommended)
# Core only (loaders, cleaners, chunkers)
uv sync
# Add OpenAI embeddings
uv add 'ragstack[openai]'
# Add local/offline embeddings (HuggingFace)
uv add 'ragstack[local]'
# Add a vector store
uv add 'ragstack[pgvector]' # PostgreSQL
uv add 'ragstack[qdrant]' # Qdrant
uv add 'ragstack[chroma]' # ChromaDB
# Install everything
uv add 'ragstack[all]'
Step 3 — Set environment variables
# Only needed if using OpenAI embeddings
export OPENAI_API_KEY="sk-..."
# Only needed if using pgvector
export TEST_POSTGRES_URL="postgresql://user:pass@localhost:5432/ragstack"
Step 4 — (pgvector only) Run the schema
psql $TEST_POSTGRES_URL -f src/ragstack_core/stores/schema.sql
Step 5 — Verify
uv run pytest src/tests/
Quick Start — Full Pipeline
from ragstack_core.loaders import PdfLoader
from ragstack_core.cleaners.pipeline import TextCleaningPipeline
from ragstack_core.chunkers.fixed_size_chunker import FixedSizeChunker, ModelType
from ragstack_core.embedders import create_embedder, EmbeddingProvider
from ragstack_core.stores import create_store, VectorStoreProvider
# 1. Load
loader = PdfLoader(pages_per_block=1)
info = loader.load_info("report.pdf")
blocks = list(loader.load_blocks("report.pdf", info))
# 2. Clean
pipeline = TextCleaningPipeline.for_pdf()
clean_blocks = pipeline.clean_blocks(blocks)
# 3. Chunk
chunker = FixedSizeChunker(ModelType.OPENAI_EMBEDDING) # 512 tokens, 50 overlap
chunks = [chunk for block in clean_blocks for chunk in chunker.chunk_block(block)]
# 4. Embed
embedder = create_embedder(EmbeddingProvider.OPENAI) # reads OPENAI_API_KEY
# 5. Store
store = create_store(VectorStoreProvider.CHROMA, connection_string=":memory:")
store.upsert(chunks, embedder)
# 6. Search
results = store.search_with_scores("What are the key findings?", embedder, top_k=5)
for chunk, score in results:
print(f"[{score:.3f}] {chunk.text[:200]}")
# 7. Clean up
store.delete_by_document_id(info.document_id)
Using Each Module Independently
Loaders
from ragstack_core.loaders import TextLoader, PdfLoader, CsvLoader, MarkdownLoader
from ragstack_core.loaders.excel_loader import ExcelLoader
loader = TextLoader(lines_per_block=50)
info = loader.load_info("notes.txt")
for block in loader.load_blocks("notes.txt", info):
print(block.block_index, block.text[:80])
Cleaners
from ragstack_core.cleaners.pipeline import TextCleaningPipeline
# Preset pipelines — pick the right one for your file type
pipeline = TextCleaningPipeline.default() # general purpose
pipeline = TextCleaningPipeline.for_pdf() # removes headers/footers, ligatures
pipeline = TextCleaningPipeline.for_markdown() # strips MD syntax
pipeline = TextCleaningPipeline.for_tabular() # CSV/Excel normalisation
pipeline = TextCleaningPipeline.with_pii_redaction(TextCleaningPipeline.default())
cleaned_block = pipeline.clean_block(block)
Chunkers
from ragstack_core.chunkers.fixed_size_chunker import FixedSizeChunker, ModelType
chunker = FixedSizeChunker(ModelType.CLAUDE) # 1024 tokens, 100 overlap
chunker = FixedSizeChunker(ModelType.OPENAI_EMBEDDING) # 512 tokens, 50 overlap
chunker = FixedSizeChunker(chunk_size=300, overlap=30) # manual
for chunk in chunker.chunk_block(block):
print(chunk.chunk_id, chunk.metadata["token_count"])
Embedders
from ragstack_core.embedders import create_embedder, EmbeddingProvider
# Cloud — requires OPENAI_API_KEY
embedder = create_embedder(EmbeddingProvider.OPENAI)
# Local — no API key, uses HuggingFace (runs on CPU or CUDA)
embedder = create_embedder(EmbeddingProvider.LOCAL, device="cpu")
vectors = embedder.embed(["sentence one", "sentence two"]) # list[list[float]]
print(embedder.model_name, embedder.dimensions)
Vector Stores
from ragstack_core.stores import create_store, VectorStoreProvider
# In-memory (dev/testing)
store = create_store(VectorStoreProvider.CHROMA, connection_string=":memory:")
store = create_store(VectorStoreProvider.QDRANT, connection_string=":memory:")
# Production
store = create_store(VectorStoreProvider.PGVECTOR, connection_string="postgresql://...")
store.upsert(chunks, embedder)
results = store.search("query text", embedder, top_k=5)
store.delete_by_document_id("doc-id")
Running the Examples
uv run python examples/loaders.py
uv run python examples/cleaning.py
uv run python examples/chunking.py
uv run python examples/full_pipeline.py
Running Tests
uv run pytest src/tests/ # all tests
uv run pytest src/tests/test_loader.py # one file
uv run pytest src/tests/test_loader.py::test_name # one test
pgvector integration tests require TEST_POSTGRES_URL env var. Without it they are automatically skipped.
Key Design Decisions
Content-hash IDs. document_id is derived from source_path:file_size:mtime. Re-indexing the same file produces the same ID, making upserts idempotent. chunk_id is derived from document_id:chunk_index:text_hash.
Protocol-based extensibility. EmbedderProtocol and VectorStoreProtocol are structural protocols. You can add a new embedder or store by implementing the protocol — no base class inheritance needed.
Optional dependencies. The core package is lightweight. Each optional integration (openai, pgvector, etc.) is a separate install group so you never pull in libraries you don't use.
Factory functions as the public API. Users call create_embedder() and create_store(), never the concrete classes. This hides implementation details and lets the internals change without breaking calling code.
Planned App Layer
app/
routes/ # Thin FastAPI handlers — no business logic
services/ # Orchestration and business logic
repositories/ # Database/storage access
models/ # Pydantic request/response schemas
config/ # Environment-based configuration
The app layer (FastAPI, REST API, MCP server) is not yet implemented. ragstack_core is intentionally decoupled from it.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragstack_core-0.1.0.tar.gz.
File metadata
- Download URL: ragstack_core-0.1.0.tar.gz
- Upload date:
- Size: 43.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dec7948beca094022c842aba85c38414986b746d946bf1f411e132e163b38fb
|
|
| MD5 |
90efa93feb3be40f36ec0833a5641f72
|
|
| BLAKE2b-256 |
7d90f7c5e1864b5d9db37860dcb8d7445db2c1ad465841e84dde031bc0697d48
|
File details
Details for the file ragstack_core-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ragstack_core-0.1.0-py3-none-any.whl
- Upload date:
- Size: 61.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33889cc5c21d495af8d5253a6bccdb2cb6a1c1171cc0cdf89003220f05fc4507
|
|
| MD5 |
cb26e4cbbd002b35376e23fe76051fc3
|
|
| BLAKE2b-256 |
586b408c7ed56a914510baafd31ad9a5da01d80d4e29d0654c284ca720343fd3
|