Rufus: Intelligent Web Data Preparation for RAG Agents
Project description
Rufus
Rufus is an AI-powered tool designed to intelligently crawl websites and extract relevant data for use in Retrieval-Augmented Generation (RAG) pipelines.
Table of Contents
- Features
- Installation
- Usage
- How Rufus Works
- Integrating Rufus into a RAG Pipeline
- Dependencies
- Contributing
- License
API UPDATE (v0.1.2) : Chunks data, returns output in downloadable json format with metadata to aid in implementing into RAG pipeline!
Features
- Intelligent web crawling based on user instructions.
- Advanced Natural Language Processing (NLP) using spaCy for keyword extraction and content relevance.
- Extracts metadata such as titles, headings, and last updated dates.
- Structured output suitable for integration into RAG pipelines.
Installation
Prerequisites
- Python 3.7 or higher
- Google Chrome (for Selenium WebDriver)
Install Rufus
pip install rufus-ai
Install spaCy Language Model
python -m spacy download en_core_web_lg
Usage
Initializing the Rufus Client
from rufus import RufusClient
import os
# Get your API key (currently any non-empty string)
key = os.getenv('RUFUS_API_KEY', 'your_default_api_key')
# Initialize Rufus client
client = RufusClient(api_key=key)
Scraping a Website
from rufus.client import RufusClient
import os
# api key -(currently any non-empty string)
key = os.getenv('RUFUS_API_KEY', 'default_key')
# Initialize Rufus client
client = RufusClient(api_key=key)
url = 'https://www.taniarascia.com'
instructions = "extract articles about javascript, react, web-development"
# Scrape the website
documents = client.scrape(url, instructions)
# Output the results
output_folder = 'outputs'
file_path = os.path.join(output_folder, 'testwebsite.json')
import json
with open(file_path, 'w') as f:
json.dump(documents, f, indent=4)
print(f"Data saved to {file_path}")
How Rufus Works
Rufus consists of two main components:
Crawler
- Navigates through the provided website URL.
- Uses Selenium WebDriver to handle dynamic content and JavaScript-rendered pages.
- Collects HTML content from pages relevant to the user's instructions.
Parser
- Processes the HTML content using BeautifulSoup.
- Utilizes spaCy's NLP capabilities to extract keywords from user instructions.
- Identifies and extracts relevant content based on the extracted keywords.
- Extracts metadata such as titles, headings, and last updated dates.
Keyword Extraction and Content Relevance
- Keyword Extraction:
- Uses spaCy's
en_core_web_lgmodel for advanced NER and NLP tasks. - Extracts noun chunks, named entities, and significant nouns/proper nouns from the instructions.
- Uses spaCy's
- Content Matching:
- Tokenizes and lemmatizes page content.
- Matches content against extracted keywords to determine relevance.
Integrating Rufus into a RAG Pipeline
To integrate Rufus into a Retrieval-Augmented Generation (RAG) pipeline:
- Data Collection:
- Use Rufus to scrape and parse relevant documents from target websites based on specific instructions.
- Data Preprocessing:
- Rufus chunks the data before outputting it, so it aids retrieval by the RAG model
- Indexing:
- Feed the processed data into a vector store or database (e.g., Elasticsearch, Pinecone) to enable efficient retrieval.
- Retrieval:
- When a query is made, retrieve relevant documents from the vector store based on semantic similarity.
- Rank retrieved document if required, in terms of relevance.
- Generation:
- Use a language model (e.g., GPT-3, GPT-4) to generate responses that are augmented with the retrieved documents.
- Feedback Loop:
- Optionally, use user feedback to further refine the retrieval and generation process.
Dependencies
- BeautifulSoup4: HTML parsing
- Requests: Handling HTTP requests
- spaCy: Advanced NLP tasks
- Requires
en_core_web_lglanguage model
- Requires
- Selenium: Web browser automation
- webdriver-manager: Manages WebDriver binaries
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rufus_ai-0.1.3.tar.gz.
File metadata
- Download URL: rufus_ai-0.1.3.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19ec182b898a5f8b50f695f534461b53928b6a6bf62a5b75d85715f7cd86b8d8
|
|
| MD5 |
bad49a51b01338cebaa2c2538b08dea0
|
|
| BLAKE2b-256 |
03d1d9cfbe85c8e5d69377045fa912ca4484de64de9f85b0ae09e06ed22bb57a
|
File details
Details for the file rufus_ai-0.1.3-py3-none-any.whl.
File metadata
- Download URL: rufus_ai-0.1.3-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8a378e3387391013b8fa17e528e7599189106a89dcf023f15698dd84aab2f53
|
|
| MD5 |
915a11261512f7302ff47987c400947b
|
|
| BLAKE2b-256 |
8ca46de4a6fa010546450311c31965051c4c1de541ba1349cd63e8b2cf0510d0
|