Distributed RAG QA-evaluation dataset generator for local files and Confluence.
Project description
🌊 Slurp
Cross-Document RAG Dataset Generator
A tool for generating RAG (Retrieval-Augmented Generation) evaluation datasets from local files or Confluence pages, using any OpenAI-compatible LLM.
This system uses a distributed architecture with Kafka for task queue management, separate scraper and worker processes, and supports both batch and streaming processing.
Architecture
- Scraper: Discovers and submits Confluence pages to a Kafka topic
- Worker: Processes pages from Kafka, generates QA pairs, and stores results in SQLite
- Kafka/Redpanda: Message queue for task distribution
- SQLite: Storage for processed documents and generated QA pairs
Features
- Pluggable Connectors: Ingest from local files or a Confluence space via
--connector - Any OpenAI-Compatible LLM: OpenRouter by default, or point
--generator-base-urlat any endpoint - Distributed Processing: Separate scraper and worker processes for scalability
- Batch Processing: Support for processing documents in batches for cross-document questions
- Date Filtering: Filter Confluence pages by last modification date (e.g., only pages updated in the last 6 months)
- Intelligent Question Generation: Creates questions of varying difficulty levels
- Multiple Languages: Support for German and English content
- HTML Processing: Robust HTML parsing for clean text extraction
- Fail-Fast Configuration: Validates env vars and CLI flags at startup with a clear, aggregated error
Installation
# Install core package (using uv)
uv sync
# Install with standalone script dependencies
uv sync --extra scripts
# Start Redpanda (Kafka-compatible broker); the compose file lives in infra/
docker compose -f infra/docker-compose.yaml up -d
Configuration
Copy .env.example to .env and fill in the values. A .env file in the working
directory is auto-loaded. Precedence is CLI flag > env var > .env file > default.
Legacy names (CONFLUENCE_*, KAFKA_*, SQLITE_*, OPENROUTER_API_KEY) still work
as aliases for the new SLURP_ vars.
Key variables:
SLURP_LLM_API_KEY="" # required when the generator is enabled
SLURP_CONNECTOR="local" # local | confluence
SLURP_CONFLUENCE_BASE_URL="https://your-domain.atlassian.net"
SLURP_CONFLUENCE_USERNAME="you@example.com"
SLURP_CONFLUENCE_API_KEY=""
SLURP_CONFLUENCE_SPACE="" # required when SLURP_CONNECTOR=confluence
SLURP_KAFKA_BOOTSTRAP_SERVERS="localhost:19092"
SLURP_KAFKA_TOPIC="tasks"
SLURP_SQLITE_DATABASE="./data.db"
See .env.example for the full list including generator, local connector, and
observability options.
LLM provider
The QA generator talks to any OpenAI-compatible endpoint, selected by
--generator-base-url and an API key. The key is read from SLURP_LLM_API_KEY
(legacy LLM_API_KEY and OPENROUTER_API_KEY still work as aliases).
# Default: OpenRouter
export SLURP_LLM_API_KEY="your-openrouter-key"
# Any other OpenAI-compatible endpoint
export SLURP_LLM_API_KEY="$(your-token-command)" # or a static key
python -m slurp worker \
--generator-base-url https://your-llm-endpoint.example/v1 \
--generator-model your-model
Usage
Connectors
Slurp ingests content through pluggable connectors, selected with
--connector. The default is local.
| Connector | Source | Requires |
|---|---|---|
local |
Files on disk (.md/.html/.txt) |
nothing (no Confluence creds) |
confluence |
A Confluence space | SLURP_CONFLUENCE_* credentials |
Both connectors still flow through Kafka and the LLM generator, so a broker
(infra/docker-compose.yaml) and SLURP_LLM_API_KEY are required either way.
Local files (default)
# Scrape a directory of documents into the queue
python -m slurp scraper --local-path ./docs
# Only markdown files
python -m slurp scraper --local-path ./docs --local-extensions .md
# A single file
python -m slurp scraper --local-path ./docs/intro.md
# Then run the worker (it dispatches on each task's connector automatically)
python -m slurp worker --generator-batch-size 1
Live dataset view
# Serve an auto-refreshing HTML view of the generated QA pairs (default :8077)
python -m slurp render --open --sqlite-database ./data.db
The page polls the SQLite generations table, so QA pairs appear as the worker
produces them.
Slurp skill (for Claude Code)
python -m slurp skill # print the bundled SKILL.md
python -m slurp skill --install # write it to ./.claude/skills/slurp/SKILL.md
Distributed System (Production Mode)
Running the Scraper
The scraper discovers Confluence pages and submits them to Kafka
(note the explicit --connector confluence, since local is the default):
# Scrape up to 50 pages from a Confluence space
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-max-pages 50
# Filter by recent pages (last 3 months)
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-months-back 3
# Skip the first 100 pages
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-skip 100
# Run multiple scraper workers
python -m slurp scraper --workers 2 --connector confluence --confluence-space RESEARCH
Running the Worker
The worker processes pages from Kafka and generates QA pairs:
# Process pages individually
python -m slurp worker --generator-batch-size 1
# Process pages in batches of 4 for cross-document questions
python -m slurp worker --generator-batch-size 4
# Specify a different model
python -m slurp worker --generator-model "anthropic/claude-3-sonnet"
# Run multiple worker processes
python -m slurp worker --workers 4 --generator-language de
Command Line Options
Scraper Options
--confluence-space: Confluence space key to scrape--confluence-max-pages: Maximum number of pages to fetch (default: 50)--confluence-months-back: Only process pages modified within last N months (0 = no filter, default: 0)--confluence-skip: Number of pages to skip (default: 0)--confluence-concurrency: Number of concurrent requests (default: 4)--confluence-page-batch-size: Number of pages to fetch per batch (default: 50)
Worker Options
--generator-batch-size: Number of documents to process together (default: 1)--generator-model: LLM model to use (default: "google/gemini-2.5-flash-preview-05-20")--generator-language: Language for generated questions (default: "de")--generator-difficulty-ratio: Question difficulty (easy/medium/hard/mixed/balanced)--generator-concurrency: Number of concurrent LLM requests (default: 5)
Data Storage
The system uses SQLite for storing processed documents and generated QA pairs:
task_results: Stores processed Confluence pagesgenerations: Stores generated QA pairs with references to source pages
Troubleshooting
Common Issues
- Kafka Connection Errors: Ensure Redpanda is running (
docker compose -f infra/docker-compose.yaml ps) - Invalid Configuration: Slurp validates config at startup and prints exactly what is missing or out of range — read the error and see
.env.example - Database Errors: Verify SQLite database permissions and path
- LLM API Errors: Check
SLURP_LLM_API_KEYand your provider's quota; for non-OpenRouter endpoints also set--generator-base-url - HTML Parsing Issues: The HTML parser has been optimized for Confluence pages
System Components
- Scraper: Discovers and submits Confluence pages to Kafka
- Worker: Processes pages from Kafka and generates QA pairs
- LLMGenerator: Generates questions and answers using LLMs
- HTMLParser: Cleans and processes HTML content
- SqlitePersistence: Stores results in SQLite database
- KafkaQueueSubmitter: Submits tasks to Kafka
- KafkaConsumer: Consumes tasks from Kafka
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slurp_rag-0.1.0.tar.gz.
File metadata
- Download URL: slurp_rag-0.1.0.tar.gz
- Upload date:
- Size: 387.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5aadfd28929d0960d748b14f33d57ef41139452b71da84e81a65121ae02becd
|
|
| MD5 |
a9f98384ac3f668aa2614a48b365fadc
|
|
| BLAKE2b-256 |
b0689babd3129ca70dbfece397e4cee19e868efdd2873e01b9da8ca9e66dff8c
|
Provenance
The following attestation bundles were made for slurp_rag-0.1.0.tar.gz:
Publisher:
release.yml on 4thel00z/slurp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurp_rag-0.1.0.tar.gz -
Subject digest:
f5aadfd28929d0960d748b14f33d57ef41139452b71da84e81a65121ae02becd - Sigstore transparency entry: 1807713811
- Sigstore integration time:
-
Permalink:
4thel00z/slurp@d8b523aacbf2fc7c057823a54000a14ba1eac22b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d8b523aacbf2fc7c057823a54000a14ba1eac22b -
Trigger Event:
push
-
Statement type:
File details
Details for the file slurp_rag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: slurp_rag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 50.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
753d466831d88f67970c91920971efad7a57efb3baef292527bc99d0bda95721
|
|
| MD5 |
7cec2094160b11dd418e28cdd317106f
|
|
| BLAKE2b-256 |
a23c050c013218d6a6a658ee454dbf0ab27a800fdc00bf176c6437294a8de4bd
|
Provenance
The following attestation bundles were made for slurp_rag-0.1.0-py3-none-any.whl:
Publisher:
release.yml on 4thel00z/slurp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slurp_rag-0.1.0-py3-none-any.whl -
Subject digest:
753d466831d88f67970c91920971efad7a57efb3baef292527bc99d0bda95721 - Sigstore transparency entry: 1807713856
- Sigstore integration time:
-
Permalink:
4thel00z/slurp@d8b523aacbf2fc7c057823a54000a14ba1eac22b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/4thel00z
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d8b523aacbf2fc7c057823a54000a14ba1eac22b -
Trigger Event:
push
-
Statement type: