Skip to main content

RAGScraper is a Python library designed for efficient and intelligent scraping of web documentation and content. Tailored for Retrieval-Augmented Generation systems, RAGScraper extracts and preprocesses text into structured, machine-learning-ready formats. It emphasizes precision, context preservation, and ease of integration with RAG models, making it an ideal tool for developers looking to enhance AI-driven applications with rich, web-sourced knowledge.

Project description

RAGScraper

RAGScraper is a simple Python package that scrapes webpages and converts them to markdown format for RAG usage.

Installation

To install RAGScraper, simply run:

pip install ragscraper

Usage

To use RAGScraper as a command-line tool:

rag-scraper <URL>

To use RAGScraper in a Python script:

from rag_scraper.scraper import Scraper
from rag_scraper.converter import Converter

# Fetch HTML content
url = "https://example.com"
html_content = Scraper.fetch_html(url)

# Convert to Markdown
markdown_content = Converter.html_to_markdown(
    html=html_content, 
    base_url=base_url,
    parser_features='html.parser', 
    ignore_links=True
)
print(markdown_content)

Development

To run the tests for RAGScraper, navigate to the package directory and run:

python -m unittest discover tests

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragscraper-11.5.2023.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

ragscraper-11.5.2023-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file ragscraper-11.5.2023.tar.gz.

File metadata

  • Download URL: ragscraper-11.5.2023.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for ragscraper-11.5.2023.tar.gz
Algorithm Hash digest
SHA256 aea8b6d9f9c8ce77691ec9d964ae5dd6dff94ac84af11482738909a4013a23e7
MD5 ea776f9719663b4ff86a3543c62c8557
BLAKE2b-256 80f6e9fc7786ac640671feb7b99c358fb3c4282e154d1718e12b09995f706376

See more details on using hashes here.

File details

Details for the file ragscraper-11.5.2023-py3-none-any.whl.

File metadata

  • Download URL: ragscraper-11.5.2023-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for ragscraper-11.5.2023-py3-none-any.whl
Algorithm Hash digest
SHA256 e1c4553ea6b80634015d7674aaeeb3aff50d440abe604e10973d267f19bd5466
MD5 6241fa19c3740baaa6e620eaa8b20bdf
BLAKE2b-256 5a0f86f9e2aa26551b6e86d28892170f5480c3fef36e107e855667afd15e639e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page