Skip to main content

Rufus: Intelligent Web Data Preparation for RAG Agents

Project description

Rufus

Rufus is an AI-powered tool designed to intelligently crawl websites and extract relevant data for use in Retrieval-Augmented Generation (RAG) pipelines.

Table of Contents

API UPDATE (v0.1.2) : Chunks data, returns output in downloadable json format with metadata to aid in implementing into RAG pipeline!

Features

  • Intelligent web crawling based on user instructions.
  • Advanced Natural Language Processing (NLP) using spaCy for keyword extraction and content relevance.
  • Extracts metadata such as titles, headings, and last updated dates.
  • Structured output suitable for integration into RAG pipelines.

Installation

Prerequisites

  • Python 3.7 or higher
  • Google Chrome (for Selenium WebDriver)

Install Rufus

pip install rufus-ai

Install spaCy Language Model

python -m spacy download en_core_web_lg

Usage

Initializing the Rufus Client

from rufus import RufusClient
import os

# Get your API key (currently any non-empty string)
key = os.getenv('RUFUS_API_KEY', 'your_default_api_key')

# Initialize Rufus client
client = RufusClient(api_key=key)

Scraping a Website

from rufus.client import RufusClient
import os

# api key -(currently any non-empty string)
key = os.getenv('RUFUS_API_KEY', 'default_key')

# Initialize Rufus client
client = RufusClient(api_key=key)

url = 'https://www.taniarascia.com'
instructions = "extract articles about javascript, react, web-development"

# Scrape the website
documents = client.scrape(url, instructions)

# Output the results
output_folder = 'outputs'
file_path = os.path.join(output_folder, 'testwebsite.json')
import json
with open(file_path, 'w') as f:
    json.dump(documents, f, indent=4)

print(f"Data saved to {file_path}")

How Rufus Works

Rufus consists of two main components:

Crawler

  • Navigates through the provided website URL.
  • Uses Selenium WebDriver to handle dynamic content and JavaScript-rendered pages.
  • Collects HTML content from pages relevant to the user's instructions.

Parser

  • Processes the HTML content using BeautifulSoup.
  • Utilizes spaCy's NLP capabilities to extract keywords from user instructions.
  • Identifies and extracts relevant content based on the extracted keywords.
  • Extracts metadata such as titles, headings, and last updated dates.

Keyword Extraction and Content Relevance

  • Keyword Extraction:
    • Uses spaCy's en_core_web_lg model for advanced NER and NLP tasks.
    • Extracts noun chunks, named entities, and significant nouns/proper nouns from the instructions.
  • Content Matching:
    • Tokenizes and lemmatizes page content.
    • Matches content against extracted keywords to determine relevance.

Integrating Rufus into a RAG Pipeline

To integrate Rufus into a Retrieval-Augmented Generation (RAG) pipeline:

  1. Data Collection:
    • Use Rufus to scrape and parse relevant documents from target websites based on specific instructions.
  2. Data Preprocessing:
    • Rufus chunks the data before outputting it, so it aids retrieval by the RAG model
  3. Indexing:
    • Feed the processed data into a vector store or database (e.g., Elasticsearch, Pinecone) to enable efficient retrieval.
  4. Retrieval:
    • When a query is made, retrieve relevant documents from the vector store based on semantic similarity.
    • Rank retrieved document if required, in terms of relevance.
  5. Generation:
    • Use a language model (e.g., GPT-3, GPT-4) to generate responses that are augmented with the retrieved documents.
  6. Feedback Loop:
    • Optionally, use user feedback to further refine the retrieval and generation process.

Dependencies

  • BeautifulSoup4: HTML parsing
  • Requests: Handling HTTP requests
  • spaCy: Advanced NLP tasks
    • Requires en_core_web_lg language model
  • Selenium: Web browser automation
  • webdriver-manager: Manages WebDriver binaries

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rufus_ai-0.1.3.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rufus_ai-0.1.3-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file rufus_ai-0.1.3.tar.gz.

File metadata

  • Download URL: rufus_ai-0.1.3.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for rufus_ai-0.1.3.tar.gz
Algorithm Hash digest
SHA256 19ec182b898a5f8b50f695f534461b53928b6a6bf62a5b75d85715f7cd86b8d8
MD5 bad49a51b01338cebaa2c2538b08dea0
BLAKE2b-256 03d1d9cfbe85c8e5d69377045fa912ca4484de64de9f85b0ae09e06ed22bb57a

See more details on using hashes here.

File details

Details for the file rufus_ai-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: rufus_ai-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for rufus_ai-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d8a378e3387391013b8fa17e528e7599189106a89dcf023f15698dd84aab2f53
MD5 915a11261512f7302ff47987c400947b
BLAKE2b-256 8ca46de4a6fa010546450311c31965051c4c1de541ba1349cd63e8b2cf0510d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page