Unified data extraction and preprocessing toolkit for Retrieval-Augmented Generation (RAG) pipelines.

These details have not been verified by PyPI

Project links

Project description

ragready

Unified text + metadata extractors for Retrieval-Augmented Generation (RAG) pipelines
Version 0.1.0 · MIT-licensed

✨ Why ragready?

A high-quality RAG knowledge base starts with clean, consistent documents—no matter where they live.
ragready streams Markdown-normalised content from:

Source type	Iterator	Notes
GitHub / GitLab repos	`git_repo_iter`	Auth tokens supported
Atlassian Confluence	`confluence_iter`	Cloud & Data Center
Public websites	`website_iter`	BFS crawl within domain
Local files & folders	`local_iter`	PDFs, DOCX, PPTX, XLSX, CSV, images (OCR), audio, ZIPs, EPUB…

Each iterator yields a single dataclass—DocumentRecord—so downstream code never worries about source-specific quirks.

🚀 Installation

pip install ragready

Requires Python ≥ 3.9 and a working git executable for repo extraction. The package bundles markitdown[all], so DOCX/PDF/PPTX/XLSX and OCR support work out-of-the-box.

⚡ Quick start

import ragready as rr
from pprint import pprint

# Crawl python.org two links deep
records = rr.website_iter(["https://www.python.org"], crawl_depth=2)

# Collect into a DataFrame (optional)
import pandas as pd
df = pd.DataFrame(r.to_dict() for r in records)
print(df[["filename", "content"]].head())

🍱 Example snippets

1. Local files

import ragready as rr
import pandas as pd

# Optional LLM client (leave None for pure local parsing)
client = None
llm_model = None               

# Run the iterator and capture records
docs = [
    rec.to_dict()              
    for rec in rr.local_iter(
        ["./data"],           
        llm_client=client,
        llm_model=llm_model
    )
]

# Convert to a DataFrame (optional)
df = pd.DataFrame(docs)
print(df.head())               # quick peek

2. Git repo with private access

# 1) Imports
import os
import pandas as pd
import ragready as rr

# Optional token for private repos
token = os.getenv("GITHUB_TOKEN")   # set in your shell, or leave None for public

# Pick the repos you want to scan
urls = [
    "https://github.com/pandas-dev/pandas.git",
    "https://gitlab.com/your-group/your-project.git",
]

# Run the iterator(s) and collect to dicts
git_records = [
    rec.to_dict()
    for url in urls
    for rec in rr.git_repo_iter(url, token=token)
]

# Build a DataFrame (optional)
git_df = pd.DataFrame(git_records)

# Inspect or save
print("\nGit repos preview:")
print(git_df[["source", "filename", "author", "url"]].head()) # quick peek

3. Confluence (plain-text)

import os
import pandas as pd
import ragready as rr

# Stream the pages
conf_rows = [
    rec.to_dict()
    for rec in rr.confluence_iter(
        base_url=os.getenv("CONF_URL"),       # e.g. "https://your-domain.atlassian.net/wiki"
        username=os.getenv("CONF_USER"),      # your Atlassian email / user
        api_token=os.getenv("CONFLUENCE_TOKEN"),
        space_keys=["ENG", "DS"],             # any number of spaces
        plain_text=True,                      # strip HTML tags
        limit=500                             # max pages
    )
]

# Build a DataFrame
conf_df = pd.DataFrame(conf_rows)

# 3Preview key columns
print("\nConfluence preview:")
print(conf_df[["filename", "author", "url"]].head()) # quick peek

4. Website

import pandas as pd
import ragready as rr

# Website crawl → DataFrame preview
web_rows = [
    rec.to_dict()
    for rec in rr.website_iter(
        roots=[
            "https://www.python.org",      # add more starting URLs as needed
            # "https://docs.rust-lang.org",
        ],
        crawl_depth=1                      # how deep to follow links (None = unlimited)
    )
]

web_df = pd.DataFrame(web_rows)

print("\nWebsite preview:")
print(web_df[["source", "title", "url"]].head())  # quick peek

🛠️ Public API

Symbol	Description
`DocumentRecord`	Normalised dataclass each iterator yields
`git_repo_iter`	Stream files from GitHub / GitLab repos
`confluence_iter`	Stream pages from Confluence spaces
`website_iter`	Breadth-first crawl within a domain
`local_iter`	Recursively convert local files via MarkItDown & OCR

All iterators are lazy streams—process millions of docs without filling memory.

🔑 Environment variables

Purpose	Variable(s)
GitHub	`GITHUB_TOKEN`
GitLab	`GITLAB_TOKEN`
Confluence	`CONF_USER`, `CONFLUENCE_TOKEN`, `CONF_URL`

📄 License

🤝 Contributing

Fork & branch off main
pip install -e .[dev]
Run pytest + ruff check before PRs

All contributions welcome — new extractors, bug fixes, or docs!

🙏 Acknowledgements

Built on the shoulders of:

MarkItDown – universal document-to-Markdown converter
GitPython, BeautifulSoup 4, pdfplumber, python-pptx, and the wider open-source community.

Happy extracting — your RAG pipeline will thank you! 🦾

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Jul 3, 2025

0.2.0

Jul 2, 2025

0.1.3

Jul 2, 2025

0.1.2

Jul 2, 2025

This version

0.1.0

Jul 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragready-0.1.0.tar.gz (10.7 kB view details)

Uploaded Jul 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragready-0.1.0-py3-none-any.whl (10.8 kB view details)

Uploaded Jul 1, 2025 Python 3

File details

Details for the file ragready-0.1.0.tar.gz.

File metadata

Download URL: ragready-0.1.0.tar.gz
Upload date: Jul 1, 2025
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for ragready-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`52bfb58df98d8aca94370298c6c8d30c40390098b52818060ff2c7289119855a`
MD5	`1a47f7c957b0854900d5d006a57cf402`
BLAKE2b-256	`f1556e65b3f64d547ee23dd6c3713abe11246f587cccae2cd85e24643d6205b7`

See more details on using hashes here.

File details

Details for the file ragready-0.1.0-py3-none-any.whl.

File metadata

Download URL: ragready-0.1.0-py3-none-any.whl
Upload date: Jul 1, 2025
Size: 10.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for ragready-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a2a6f1f3b30aa20efc0dc1f62c86ea97a6131717db52fff37fc1e39ffbd9d6f6`
MD5	`231d654f3a2d19d90f061bd653bd2c29`
BLAKE2b-256	`b8b7aeafd30128ff847f5d26b227a6aa282f0f971a36ed8ea09cd6fc83c5a752`

See more details on using hashes here.

ragready 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ragready

✨ Why ragready?

🚀 Installation

⚡ Quick start

🍱 Example snippets

1. Local files

2. Git repo with private access

3. Confluence (plain-text)

4. Website

🛠️ Public API

🔑 Environment variables

📄 License

🤝 Contributing

🙏 Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes