Unified data extraction and preprocessing toolkit for Retrieval-Augmented Generation (RAG) pipelines.
Project description
ragready
Unified text + metadata extractors for Retrieval-Augmented Generation (RAG) pipelines
Version 0.1.2 · MIT-licensed
✨ Why ragready?
A high-quality RAG knowledge base starts with clean, consistent documents—no matter where they live.
ragready streams Markdown-normalised content from:
| Source type | Iterator | Notes |
|---|---|---|
| GitHub / GitLab repos | git_repo_iter |
Auth tokens supported |
| Atlassian Confluence | confluence_iter |
Cloud & Data Center |
| Public websites | website_iter |
BFS crawl within domain |
| Local files & folders | local_iter |
PDFs, DOCX, PPTX, XLSX, CSV, images (OCR), audio, ZIPs, EPUB… |
Each iterator yields a single dataclass—DocumentRecord—so downstream code never worries about source-specific quirks.
🚀 Installation
pip install ragready
Requires Python ≥ 3.9 and a working
gitexecutable for repo extraction. The package bundlesmarkitdown[all], so DOCX/PDF/PPTX/XLSX and OCR support work out-of-the-box.
⚡ Quick start
import ragready as rr
from pprint import pprint
# Crawl python.org two links deep
records = rr.website_iter(["https://www.python.org"], crawl_depth=2)
# Collect into a DataFrame (optional)
import pandas as pd
df = pd.DataFrame(r.to_dict() for r in records)
print(df[["filename", "content"]].head())
🍱 Example snippets
1. Local files
import ragready as rr
import pandas as pd
# Optional LLM client (leave None for pure local parsing)
client = None
llm_model = None
# Run the iterator and capture records
docs = [
rec.to_dict()
for rec in rr.local_iter(
["./data"],
llm_client=client,
llm_model=llm_model
)
]
# Convert to a DataFrame (optional)
df = pd.DataFrame(docs)
print(df.head()) # quick peek
2. Git repo with private access
# 1) Imports
import os
import pandas as pd
import ragready as rr
# Optional token for private repos
token = os.getenv("GITHUB_TOKEN") # set in your shell, or leave None for public
# Pick the repos you want to scan
urls = [
"https://github.com/pandas-dev/pandas.git",
"https://gitlab.com/your-group/your-project.git",
]
# Run the iterator(s) and collect to dicts
git_records = [
rec.to_dict()
for url in urls
for rec in rr.git_repo_iter(url, token=token)
]
# Build a DataFrame (optional)
git_df = pd.DataFrame(git_records)
# Inspect or save
print("\nGit repos preview:")
print(git_df[["source", "filename", "author", "url"]].head()) # quick peek
3. Confluence (plain-text)
import os
import pandas as pd
import ragready as rr
# Stream the pages
conf_rows = [
rec.to_dict()
for rec in rr.confluence_iter(
base_url=os.getenv("CONF_URL"), # e.g. "https://your-domain.atlassian.net/wiki"
username=os.getenv("CONF_USER"), # your Atlassian email / user
api_token=os.getenv("CONFLUENCE_TOKEN"),
space_keys=["ENG", "DS"], # any number of spaces
plain_text=True, # strip HTML tags
limit=500 # max pages
)
]
# Build a DataFrame
conf_df = pd.DataFrame(conf_rows)
# 3Preview key columns
print("\nConfluence preview:")
print(conf_df[["filename", "author", "url"]].head()) # quick peek
4. Website
import pandas as pd
import ragready as rr
# Website crawl → DataFrame preview
web_rows = [
rec.to_dict()
for rec in rr.website_iter(
roots=[
"https://www.python.org", # add more starting URLs as needed
# "https://docs.rust-lang.org",
],
crawl_depth=1 # how deep to follow links (None = unlimited)
)
]
web_df = pd.DataFrame(web_rows)
print("\nWebsite preview:")
print(web_df[["source", "title", "url"]].head()) # quick peek
🛠️ Public API
| Symbol | Description |
|---|---|
DocumentRecord |
Normalised dataclass each iterator yields |
git_repo_iter |
Stream files from GitHub / GitLab repos |
confluence_iter |
Stream pages from Confluence spaces |
website_iter |
Breadth-first crawl within a domain |
local_iter |
Recursively convert local files via MarkItDown & OCR |
All iterators are lazy streams—process millions of docs without filling memory.
🔑 Environment variables
| Purpose | Variable(s) |
|---|---|
| GitHub | GITHUB_TOKEN |
| GitLab | GITLAB_TOKEN |
| Confluence | CONF_USER, CONFLUENCE_TOKEN, CONF_URL |
📄 License
MIT © 2025 Kwadwo Daddy Nyame Owusu-Boakye
🤝 Contributing
- Fork & branch off
main pip install -e .[dev]- Run
pytest+ruff checkbefore PRs
All contributions welcome — new extractors, bug fixes, or docs!
🙏 Acknowledgements
Built on the shoulders of:
- MarkItDown – universal document-to-Markdown converter
- GitPython, BeautifulSoup 4, pdfplumber, python-pptx, and the wider open-source community.
Happy extracting — your RAG pipeline will thank you! 🦾
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragready-0.1.2.tar.gz.
File metadata
- Download URL: ragready-0.1.2.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28fbdc44f8cbd3627b3fbb102faadea4c241b123297a38c2d54602e199ab7fab
|
|
| MD5 |
979e13ff3c6f63a3bdbe83c2da855478
|
|
| BLAKE2b-256 |
65304030f7a9d888ca6d63337a2357bda6dfa1f384938c7c9f2332bdac588e6c
|
File details
Details for the file ragready-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ragready-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c78c58ef451850eb4f5f5a8a6c77086b81f77ecee3ad8492c577e0c0b1fb9b36
|
|
| MD5 |
900f08bcf01129a831a855ebdc8e2218
|
|
| BLAKE2b-256 |
c495fed84912f65ffa2946341b49c88b4dbbc7f6a335f0774938f9c59e1d4563
|