Skip to main content

RAGScraper is a Python library designed for efficient and intelligent scraping of web documentation and content. Tailored for Retrieval-Augmented Generation systems, RAGScraper extracts and preprocesses text into structured, machine-learning-ready formats. It emphasizes precision, context preservation, and ease of integration with RAG models, making it an ideal tool for developers looking to enhance AI-driven applications with rich, web-sourced knowledge.

Project description

RAGScraper

RAGScraper is a simple Python package that scrapes webpages and converts them to markdown format for RAG usage.

Installation

To install RAGScraper, simply run:

pip install ragscraper

Usage

To use RAGScraper as a command-line tool:

rag-scraper <URL>

To use RAGScraper in a Python script:

from rag_scraper.scraper import Scraper
from rag_scraper.converter import Converter

# Fetch HTML content
url = "https://example.com"
html_content = Scraper.fetch_html(url)

# Convert to Markdown
markdown_content = Converter.html_to_markdown(html_content)
print(markdown_content)

Development

To run the tests for RAGScraper, navigate to the package directory and run:

python -m unittest discover tests

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragscraper-11.4.2023.tar.gz (3.5 kB view hashes)

Uploaded Source

Built Distribution

ragscraper-11.4.2023-py3-none-any.whl (4.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page