Distributed RAG QA-evaluation dataset generator for local files and Confluence.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ransomware

These details have not been verified by PyPI

Project description

🌊 Slurp

Cross-Document RAG Dataset Generator

A tool for generating RAG (Retrieval-Augmented Generation) evaluation datasets from local files or Confluence pages, using any OpenAI-compatible LLM.

This system uses a distributed architecture with Kafka for task queue management, separate scraper and worker processes, and supports both batch and streaming processing.

Architecture

Scraper: Discovers and submits Confluence pages to a Kafka topic
Worker: Processes pages from Kafka, generates QA pairs, and stores results in SQLite
Kafka/Redpanda: Message queue for task distribution
SQLite: Storage for processed documents and generated QA pairs

Features

Pluggable Connectors: Ingest from local files or a Confluence space via --connector
Any OpenAI-Compatible LLM: OpenRouter by default, or point --generator-base-url at any endpoint
Distributed Processing: Separate scraper and worker processes for scalability
Batch Processing: Support for processing documents in batches for cross-document questions
Date Filtering: Filter Confluence pages by last modification date (e.g., only pages updated in the last 6 months)
Intelligent Question Generation: Creates questions of varying difficulty levels
Multiple Languages: Support for German and English content
HTML Processing: Robust HTML parsing for clean text extraction
Fail-Fast Configuration: Validates env vars and CLI flags at startup with a clear, aggregated error

Installation

# Install core package (using uv)
uv sync

# Install with standalone script dependencies
uv sync --extra scripts

# Start Redpanda (Kafka-compatible broker); the compose file lives in infra/
docker compose -f infra/docker-compose.yaml up -d

Configuration

Copy .env.example to .env and fill in the values. A .env file in the working directory is auto-loaded. Precedence is CLI flag > env var > .env file > default. Legacy names (CONFLUENCE_*, KAFKA_*, SQLITE_*, OPENROUTER_API_KEY) still work as aliases for the new SLURP_ vars.

Key variables:

SLURP_LLM_API_KEY=""            # required when the generator is enabled
SLURP_CONNECTOR="local"         # local | confluence
SLURP_CONFLUENCE_BASE_URL="https://your-domain.atlassian.net"
SLURP_CONFLUENCE_USERNAME="you@example.com"
SLURP_CONFLUENCE_API_KEY=""
SLURP_CONFLUENCE_SPACE=""       # required when SLURP_CONNECTOR=confluence
SLURP_KAFKA_BOOTSTRAP_SERVERS="localhost:19092"
SLURP_KAFKA_TOPIC="tasks"
SLURP_SQLITE_DATABASE="./data.db"

See .env.example for the full list including generator, local connector, and observability options.

LLM provider

The QA generator talks to any OpenAI-compatible endpoint, selected by --generator-base-url and an API key. The key is read from SLURP_LLM_API_KEY (legacy LLM_API_KEY and OPENROUTER_API_KEY still work as aliases).

# Default: OpenRouter
export SLURP_LLM_API_KEY="your-openrouter-key"

# Any other OpenAI-compatible endpoint
export SLURP_LLM_API_KEY="$(your-token-command)"   # or a static key
python -m slurp worker \
  --generator-base-url https://your-llm-endpoint.example/v1 \
  --generator-model your-model

Usage

Connectors

Slurp ingests content through pluggable connectors, selected with --connector. The default is local.

Connector	Source	Requires
`local`	Files on disk (`.md/.html/.txt`)	nothing (no Confluence creds)
`confluence`	A Confluence space	`SLURP_CONFLUENCE_*` credentials

Both connectors still flow through Kafka and the LLM generator, so a broker (infra/docker-compose.yaml) and SLURP_LLM_API_KEY are required either way.

Local files (default)

# Scrape a directory of documents into the queue
python -m slurp scraper --local-path ./docs

# Only markdown files
python -m slurp scraper --local-path ./docs --local-extensions .md

# A single file
python -m slurp scraper --local-path ./docs/intro.md

# Then run the worker (it dispatches on each task's connector automatically)
python -m slurp worker --generator-batch-size 1

Live dataset view

# Serve an auto-refreshing HTML view of the generated QA pairs (default :8077)
python -m slurp render --open --sqlite-database ./data.db

The page polls the SQLite generations table, so QA pairs appear as the worker produces them.

Slurp skill (for Claude Code)

python -m slurp skill            # print the bundled SKILL.md
python -m slurp skill --install  # write it to ./.claude/skills/slurp/SKILL.md

Distributed System (Production Mode)

Running the Scraper

The scraper discovers Confluence pages and submits them to Kafka (note the explicit --connector confluence, since local is the default):

# Scrape up to 50 pages from a Confluence space
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-max-pages 50

# Filter by recent pages (last 3 months)
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-months-back 3

# Skip the first 100 pages
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-skip 100

# Run multiple scraper workers
python -m slurp scraper --workers 2 --connector confluence --confluence-space RESEARCH

Running the Worker

The worker processes pages from Kafka and generates QA pairs:

# Process pages individually
python -m slurp worker --generator-batch-size 1

# Process pages in batches of 4 for cross-document questions
python -m slurp worker --generator-batch-size 4

# Specify a different model
python -m slurp worker --generator-model "anthropic/claude-3-sonnet"

# Run multiple worker processes
python -m slurp worker --workers 4 --generator-language de

Command Line Options

Scraper Options

--confluence-space: Confluence space key to scrape
--confluence-max-pages: Maximum number of pages to fetch (default: 50)
--confluence-months-back: Only process pages modified within last N months (0 = no filter, default: 0)
--confluence-skip: Number of pages to skip (default: 0)
--confluence-concurrency: Number of concurrent requests (default: 4)
--confluence-page-batch-size: Number of pages to fetch per batch (default: 50)

Worker Options

--generator-batch-size: Number of documents to process together (default: 1)
--generator-model: LLM model to use (default: "google/gemini-2.5-flash-preview-05-20")
--generator-language: Language for generated questions (default: "de")
--generator-difficulty-ratio: Question difficulty (easy/medium/hard/mixed/balanced)
--generator-concurrency: Number of concurrent LLM requests (default: 5)

Data Storage

The system uses SQLite for storing processed documents and generated QA pairs:

task_results: Stores processed Confluence pages
generations: Stores generated QA pairs with references to source pages

Troubleshooting

Common Issues

Kafka Connection Errors: Ensure Redpanda is running (docker compose -f infra/docker-compose.yaml ps)
Invalid Configuration: Slurp validates config at startup and prints exactly what is missing or out of range — read the error and see .env.example
Database Errors: Verify SQLite database permissions and path
LLM API Errors: Check SLURP_LLM_API_KEY and your provider's quota; for non-OpenRouter endpoints also set --generator-base-url
HTML Parsing Issues: The HTML parser has been optimized for Confluence pages

System Components

Scraper: Discovers and submits Confluence pages to Kafka
Worker: Processes pages from Kafka and generates QA pairs
LLMGenerator: Generates questions and answers using LLMs
HTMLParser: Cleans and processes HTML content
SqlitePersistence: Stores results in SQLite database
KafkaQueueSubmitter: Submits tasks to Kafka
KafkaConsumer: Consumes tasks from Kafka

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ransomware

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slurp_rag-0.1.0.tar.gz (387.3 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slurp_rag-0.1.0-py3-none-any.whl (50.1 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file slurp_rag-0.1.0.tar.gz.

File metadata

Download URL: slurp_rag-0.1.0.tar.gz
Upload date: Jun 13, 2026
Size: 387.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurp_rag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5aadfd28929d0960d748b14f33d57ef41139452b71da84e81a65121ae02becd`
MD5	`a9f98384ac3f668aa2614a48b365fadc`
BLAKE2b-256	`b0689babd3129ca70dbfece397e4cee19e868efdd2873e01b9da8ca9e66dff8c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurp_rag-0.1.0.tar.gz:

Publisher: release.yml on 4thel00z/slurp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurp_rag-0.1.0.tar.gz
- Subject digest: f5aadfd28929d0960d748b14f33d57ef41139452b71da84e81a65121ae02becd
- Sigstore transparency entry: 1807713811
- Sigstore integration time: Jun 13, 2026
Source repository:
- Permalink: 4thel00z/slurp@d8b523aacbf2fc7c057823a54000a14ba1eac22b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/4thel00z
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d8b523aacbf2fc7c057823a54000a14ba1eac22b
- Trigger Event: push

File details

Details for the file slurp_rag-0.1.0-py3-none-any.whl.

File metadata

Download URL: slurp_rag-0.1.0-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 50.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slurp_rag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`753d466831d88f67970c91920971efad7a57efb3baef292527bc99d0bda95721`
MD5	`7cec2094160b11dd418e28cdd317106f`
BLAKE2b-256	`a23c050c013218d6a6a658ee454dbf0ab27a800fdc00bf176c6437294a8de4bd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slurp_rag-0.1.0-py3-none-any.whl:

Publisher: release.yml on 4thel00z/slurp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slurp_rag-0.1.0-py3-none-any.whl
- Subject digest: 753d466831d88f67970c91920971efad7a57efb3baef292527bc99d0bda95721
- Sigstore transparency entry: 1807713856
- Sigstore integration time: Jun 13, 2026
Source repository:
- Permalink: 4thel00z/slurp@d8b523aacbf2fc7c057823a54000a14ba1eac22b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/4thel00z
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d8b523aacbf2fc7c057823a54000a14ba1eac22b
- Trigger Event: push

slurp-rag 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🌊 Slurp

Cross-Document RAG Dataset Generator

Architecture

Features

Installation

Configuration

LLM provider

Usage

Connectors

Local files (default)

Live dataset view

Slurp skill (for Claude Code)

Distributed System (Production Mode)

Running the Scraper

Running the Worker

Command Line Options

Scraper Options

Worker Options

Data Storage

Troubleshooting

Common Issues

System Components

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance