Rufus: Intelligent Web Data Preparation for RAG Agents

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Rufus

Rufus is an AI-powered tool designed to intelligently crawl websites and extract relevant data for use in Retrieval-Augmented Generation (RAG) pipelines.

Features
Installation
Usage
- Initializing the Rufus Client
- Scraping a Website
How Rufus Works
- Crawler
- Parser
Integrating Rufus into a RAG Pipeline
Dependencies
Contributing
License

API UPDATE (v0.1.2) : Chunks data, returns output in downloadable json format with metadata to aid in implementing into RAG pipeline!

Features

Intelligent web crawling based on user instructions.
Advanced Natural Language Processing (NLP) using spaCy for keyword extraction and content relevance.
Extracts metadata such as titles, headings, and last updated dates.
Structured output suitable for integration into RAG pipelines.

Installation

Prerequisites

Python 3.7 or higher
Google Chrome (for Selenium WebDriver)

Install Rufus

pip install rufus-ai

Install spaCy Language Model

python -m spacy download en_core_web_lg

Usage

Initializing the Rufus Client

from rufus import RufusClient
import os

# Get your API key (currently any non-empty string)
key = os.getenv('RUFUS_API_KEY', 'your_default_api_key')

# Initialize Rufus client
client = RufusClient(api_key=key)

Scraping a Website

from rufus.client import RufusClient
import os

# api key -(currently any non-empty string)
key = os.getenv('RUFUS_API_KEY', 'default_key')

# Initialize Rufus client
client = RufusClient(api_key=key)

url = 'https://www.taniarascia.com'
instructions = "extract articles about javascript, react, web-development"

# Scrape the website
documents = client.scrape(url, instructions)

# Output the results
output_folder = 'outputs'
file_path = os.path.join(output_folder, 'testwebsite.json')
import json
with open(file_path, 'w') as f:
    json.dump(documents, f, indent=4)

print(f"Data saved to {file_path}")

How Rufus Works

Rufus consists of two main components:

Crawler

Navigates through the provided website URL.
Uses Selenium WebDriver to handle dynamic content and JavaScript-rendered pages.
Collects HTML content from pages relevant to the user's instructions.

Parser

Processes the HTML content using BeautifulSoup.
Utilizes spaCy's NLP capabilities to extract keywords from user instructions.
Identifies and extracts relevant content based on the extracted keywords.
Extracts metadata such as titles, headings, and last updated dates.

Keyword Extraction and Content Relevance

Keyword Extraction:
- Uses spaCy's en_core_web_lg model for advanced NER and NLP tasks.
- Extracts noun chunks, named entities, and significant nouns/proper nouns from the instructions.
Content Matching:
- Tokenizes and lemmatizes page content.
- Matches content against extracted keywords to determine relevance.

Integrating Rufus into a RAG Pipeline

To integrate Rufus into a Retrieval-Augmented Generation (RAG) pipeline:

Data Collection:
- Use Rufus to scrape and parse relevant documents from target websites based on specific instructions.
Data Preprocessing:
- Rufus chunks the data before outputting it, so it aids retrieval by the RAG model
Indexing:
- Feed the processed data into a vector store or database (e.g., Elasticsearch, Pinecone) to enable efficient retrieval.
Retrieval:
- When a query is made, retrieve relevant documents from the vector store based on semantic similarity.
- Rank retrieved document if required, in terms of relevance.
Generation:
- Use a language model (e.g., GPT-3, GPT-4) to generate responses that are augmented with the retrieved documents.
Feedback Loop:
- Optionally, use user feedback to further refine the retrieval and generation process.

Dependencies

BeautifulSoup4: HTML parsing
Requests: Handling HTTP requests
spaCy: Advanced NLP tasks
- Requires en_core_web_lg language model
Selenium: Web browser automation
webdriver-manager: Manages WebDriver binaries

License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.3

Oct 23, 2024

0.1.2

Oct 23, 2024

0.1.1

Oct 23, 2024

0.1.0

Oct 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rufus_ai-0.1.3.tar.gz (7.5 kB view details)

Uploaded Oct 23, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rufus_ai-0.1.3-py3-none-any.whl (8.2 kB view details)

Uploaded Oct 23, 2024 Python 3

File details

Details for the file rufus_ai-0.1.3.tar.gz.

File metadata

Download URL: rufus_ai-0.1.3.tar.gz
Upload date: Oct 23, 2024
Size: 7.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for rufus_ai-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`19ec182b898a5f8b50f695f534461b53928b6a6bf62a5b75d85715f7cd86b8d8`
MD5	`bad49a51b01338cebaa2c2538b08dea0`
BLAKE2b-256	`03d1d9cfbe85c8e5d69377045fa912ca4484de64de9f85b0ae09e06ed22bb57a`

See more details on using hashes here.

File details

Details for the file rufus_ai-0.1.3-py3-none-any.whl.

File metadata

Download URL: rufus_ai-0.1.3-py3-none-any.whl
Upload date: Oct 23, 2024
Size: 8.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for rufus_ai-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8a378e3387391013b8fa17e528e7599189106a89dcf023f15698dd84aab2f53`
MD5	`915a11261512f7302ff47987c400947b`
BLAKE2b-256	`8ca46de4a6fa010546450311c31965051c4c1de541ba1349cd63e8b2cf0510d0`

See more details on using hashes here.

rufus-ai 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Rufus

Table of Contents

API UPDATE (v0.1.2) : Chunks data, returns output in downloadable json format with metadata to aid in implementing into RAG pipeline!

Features

Installation

Prerequisites

Install Rufus

Install spaCy Language Model

Usage

Initializing the Rufus Client

Scraping a Website

How Rufus Works

Crawler

Parser

Keyword Extraction and Content Relevance

Integrating Rufus into a RAG Pipeline

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes