Skip to main content

Unified data extraction and preprocessing toolkit for Retrieval-Augmented Generation (RAG) pipelines.

Project description

ragready

Unified text + metadata extractors for Retrieval-Augmented Generation (RAG) pipelines
Version 0.1.2 · MIT-licensed

PyPI Downloads


✨ Why ragready?

A high-quality RAG knowledge base starts with clean, consistent documents—no matter where they live.
ragready streams Markdown-normalised content from:

Source type Iterator Notes
GitHub / GitLab repos git_repo_iter Auth tokens supported
Atlassian Confluence confluence_iter Cloud & Data Center
Public websites website_iter BFS crawl within domain
Local files & folders local_iter PDFs, DOCX, PPTX, XLSX, CSV, images (OCR), audio, ZIPs, EPUB…

Each iterator yields a single dataclass—DocumentRecord—so downstream code never worries about source-specific quirks.


🚀 Installation

pip install ragready

Requires Python ≥ 3.9 and a working git executable for repo extraction. The package bundles markitdown[all], so DOCX/PDF/PPTX/XLSX and OCR support work out-of-the-box.


⚡ Quick start

import ragready as rr
from pprint import pprint

# Crawl python.org two links deep
records = rr.website_iter(["https://www.python.org"], crawl_depth=2)

# Collect into a DataFrame (optional)
import pandas as pd
df = pd.DataFrame(r.to_dict() for r in records)
print(df[["filename", "content"]].head())

🍱 Example snippets

1. Local files

import ragready as rr
import pandas as pd

# Optional LLM client (leave None for pure local parsing)
client = None
llm_model = None               

# Run the iterator and capture records
docs = [
    rec.to_dict()              
    for rec in rr.local_iter(
        ["./data"],           
        llm_client=client,
        llm_model=llm_model
    )
]

# Convert to a DataFrame (optional)
df = pd.DataFrame(docs)
print(df.head())               # quick peek

2. Git repo with private access

# 1) Imports
import os
import pandas as pd
import ragready as rr

# Optional token for private repos
token = os.getenv("GITHUB_TOKEN")   # set in your shell, or leave None for public

# Pick the repos you want to scan
urls = [
    "https://github.com/pandas-dev/pandas.git",
    "https://gitlab.com/your-group/your-project.git",
]

# Run the iterator(s) and collect to dicts
git_records = [
    rec.to_dict()
    for url in urls
    for rec in rr.git_repo_iter(url, token=token)
]

# Build a DataFrame (optional)
git_df = pd.DataFrame(git_records)

# Inspect or save
print("\nGit repos preview:")
print(git_df[["source", "filename", "author", "url"]].head()) # quick peek

3. Confluence (plain-text)

import os
import pandas as pd
import ragready as rr

# Stream the pages
conf_rows = [
    rec.to_dict()
    for rec in rr.confluence_iter(
        base_url=os.getenv("CONF_URL"),       # e.g. "https://your-domain.atlassian.net/wiki"
        username=os.getenv("CONF_USER"),      # your Atlassian email / user
        api_token=os.getenv("CONFLUENCE_TOKEN"),
        space_keys=["ENG", "DS"],             # any number of spaces
        plain_text=True,                      # strip HTML tags
        limit=500                             # max pages
    )
]

# Build a DataFrame
conf_df = pd.DataFrame(conf_rows)

# 3Preview key columns
print("\nConfluence preview:")
print(conf_df[["filename", "author", "url"]].head()) # quick peek

4. Website

import pandas as pd
import ragready as rr

# Website crawl → DataFrame preview
web_rows = [
    rec.to_dict()
    for rec in rr.website_iter(
        roots=[
            "https://www.python.org",      # add more starting URLs as needed
            # "https://docs.rust-lang.org",
        ],
        crawl_depth=1                      # how deep to follow links (None = unlimited)
    )
]

web_df = pd.DataFrame(web_rows)

print("\nWebsite preview:")
print(web_df[["source", "title", "url"]].head())  # quick peek

🛠️ Public API

Symbol Description
DocumentRecord Normalised dataclass each iterator yields
git_repo_iter Stream files from GitHub / GitLab repos
confluence_iter Stream pages from Confluence spaces
website_iter Breadth-first crawl within a domain
local_iter Recursively convert local files via MarkItDown & OCR

All iterators are lazy streams—process millions of docs without filling memory.


🔑 Environment variables

Purpose Variable(s)
GitHub GITHUB_TOKEN
GitLab GITLAB_TOKEN
Confluence CONF_USER, CONFLUENCE_TOKEN, CONF_URL

📄 License

MIT © 2025 Kwadwo Daddy Nyame Owusu-Boakye


🤝 Contributing

  1. Fork & branch off main
  2. pip install -e .[dev]
  3. Run pytest + ruff check before PRs

All contributions welcome — new extractors, bug fixes, or docs!


🙏 Acknowledgements

Built on the shoulders of:

  • MarkItDown – universal document-to-Markdown converter
  • GitPython, BeautifulSoup 4, pdfplumber, python-pptx, and the wider open-source community.

Happy extracting — your RAG pipeline will thank you! 🦾


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragready-0.1.2.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragready-0.1.2-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file ragready-0.1.2.tar.gz.

File metadata

  • Download URL: ragready-0.1.2.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for ragready-0.1.2.tar.gz
Algorithm Hash digest
SHA256 28fbdc44f8cbd3627b3fbb102faadea4c241b123297a38c2d54602e199ab7fab
MD5 979e13ff3c6f63a3bdbe83c2da855478
BLAKE2b-256 65304030f7a9d888ca6d63337a2357bda6dfa1f384938c7c9f2332bdac588e6c

See more details on using hashes here.

File details

Details for the file ragready-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ragready-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for ragready-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c78c58ef451850eb4f5f5a8a6c77086b81f77ecee3ad8492c577e0c0b1fb9b36
MD5 900f08bcf01129a831a855ebdc8e2218
BLAKE2b-256 c495fed84912f65ffa2946341b49c88b4dbbc7f6a335f0774938f9c59e1d4563

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page