Generate high-quality datasets from web content for AI training

These details have not been verified by PyPI

Project links

Homepage

Project description

WebRover 🚀

WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.

🌟 Features

Smart Web Scraping: Automatically find and scrape relevant content based on topics
Multiple Input Formats: Support for JSON, YAML, TXT, and Markdown topic files
Async Processing: Fast, concurrent scraping with built-in rate limiting
Quality Control: Built-in content validation and cleaning
LLM-Ready Output: Structured JSONL format perfect for model training
Error Handling: Robust error tracking and recovery mechanisms

⚠️ Important Notes

Cloud Environment Compatibility

When using WebRover in cloud environments like Google Colab or Kaggle Notebooks, you may need to handle nested asyncio loops. This is a limitation of these environments, not WebRover itself. To resolve this:

Install nest_asyncio:

pip install nest_asyncio

Add these lines at the start of your notebook:

import nest_asyncio
nest_asyncio.apply()

This setup is only required for:

Google Colab
Kaggle Notebooks
Similar cloud-based Jupyter environments

It's not needed for:

Local Python scripts
Command line usage
Standard server deployments

🚀 Troubleshooting

Cloud Environment Issues

When using WebRover in cloud environments (Google Colab, Kaggle Notebooks), you may encounter asyncio-related errors. This is due to how these environments handle async operations. To fix:

# Install the required package
pip install nest_asyncio

# Add at the start of your notebook
import nest_asyncio
nest_asyncio.apply()

Common Issues and Solutions

Rate Limiting
- Symptom: Many HTTP 429 errors
- Solution: Decrease scraping speed by increasing sleep time between requests
Memory Issues with Large Datasets
- Symptom: Out of memory errors
- Solution: Use smaller batch sizes or enable disk caching
Blocked Access
- Symptom: HTTP 403 Forbidden errors
- Solution: Ensure your user agent is set correctly and respect robots.txt
SSL Certificate Errors
- Symptom: SSL verification failed
- Solution: Update your Python SSL certificates or check network settings

🚀 Quick Start

Installation

pip install webrover

Basic Usage

from webrover import WebRover

# Initialize WebRover
rover = WebRover()

# Scrape content from topics
rover.scrape_topics(
    topics=["artificial intelligence", "machine learning"],
    sites_per_topic=20  # Will get 20 sites for each topic
)

# Save the dataset
rover.save_dataset("my_dataset.jsonl")

Using Topic Files

# From JSON file
rover.scrape_topics(
    topics="topics.json",
    num_websites=100
)

# From Markdown list
rover.scrape_topics(
    topics="topics.md",
    num_websites=100
)

📖 Documentation

Supported Topic File Formats

JSON

{
    "topics": [
        "AI basics",
        "machine learning",
        "deep learning"
    ]
}

YAML

topics:
  - AI basics
  - machine learning
  - deep learning

Markdown

- AI basics
- machine learning
- deep learning

Output Structure

{
    'url': 'https://example.com/article',
    'title': 'Article Title',
    'content': 'Article content...',
    'metadata': {
        'length': 1234,
        'has_title': true,
        'domain': 'example.com'
    }
}

🛠️ Advanced Usage

# Initialize with custom output directory
rover = WebRover(output_dir="my_datasets")

# Get scraping statistics
stats = rover.get_stats()
print(f"Success rate: {stats['success_rate']*100:.1f}%")

# Access dataset programmatically
dataset = rover.get_dataset()

📊 Output Files

final_dataset/dataset.jsonl: Main dataset in JSONL format
websites_master.json: List of all discovered URLs
websites_completed.json: Successfully scraped URLs
websites_errors.json: Failed attempts with error details

🔄 Error Handling

WebRover automatically handles common issues:

Rate limiting
Network timeouts
Invalid URLs
Blocked requests
Malformed content

🚧 Limitations

Respects robots.txt and site rate limits
Some sites may block automated access
Large datasets require more processing time
Google search may throttle excessive requests

🗺️ Roadmap

See FUTURE.md for planned features and improvements.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ by Area-25. Special thanks to all contributors.

WebRover: Build better datasets, train better models. 🚀

🧪 Development & Testing

Setting Up Development Environment

Clone the repository:

git clone https://github.com/Area-25/webrover.git
cd webrover

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install development dependencies:

pip install -e ".[tests]"

Running Tests

Run the test suite:

python -m pytest tests/

For test coverage report:

python -m pytest tests/ --cov=webrover

Supported Python Versions

Python 3.10
Python 3.11

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.12

Dec 1, 2024

0.1.11

Nov 29, 2024

0.1.10

Nov 29, 2024

0.1.9

Nov 29, 2024

0.1.8

Nov 29, 2024

0.1.7

Nov 29, 2024

0.1.6

Nov 29, 2024

0.1.5

Nov 29, 2024

0.1.4

Nov 29, 2024

0.1.3

Nov 29, 2024

0.1.1

Nov 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webrover-0.1.12.tar.gz (15.3 kB view details)

Uploaded Dec 1, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webrover-0.1.12-py3-none-any.whl (13.8 kB view details)

Uploaded Dec 1, 2024 Python 3

File details

Details for the file webrover-0.1.12.tar.gz.

File metadata

Download URL: webrover-0.1.12.tar.gz
Upload date: Dec 1, 2024
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for webrover-0.1.12.tar.gz
Algorithm	Hash digest
SHA256	`95a9fbb057de4eb269711483fd6d4246b2aec63fb38eb6d70c7cb87dfe12f14a`
MD5	`b788caa34eb0cffab552d2768969c779`
BLAKE2b-256	`91fa2dbc4a007712c680d74759974d0935ad25bc72a4b284fe1e5f552bdbf06a`

See more details on using hashes here.

File details

Details for the file webrover-0.1.12-py3-none-any.whl.

File metadata

Download URL: webrover-0.1.12-py3-none-any.whl
Upload date: Dec 1, 2024
Size: 13.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for webrover-0.1.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0ce458cfaf44cb0d9bb042f69e1b874ddb5935da66aaff538d369a1efa55ffb`
MD5	`f042df09addbdfd63fca704d5f4ec14b`
BLAKE2b-256	`f6a132a791c5231aad8ab7b82dcf965658ad64aa9152ef7083bfd2c8b7c0dcf1`

See more details on using hashes here.

webrover 0.1.12

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebRover 🚀

🌟 Features

⚠️ Important Notes

Cloud Environment Compatibility

🚀 Troubleshooting

Cloud Environment Issues

Common Issues and Solutions

🚀 Quick Start

Installation

Basic Usage

Using Topic Files

📖 Documentation

Supported Topic File Formats

JSON

YAML

Markdown

Output Structure

🛠️ Advanced Usage

📊 Output Files

🔄 Error Handling

🚧 Limitations

🗺️ Roadmap

🤝 Contributing

📜 License

🙏 Acknowledgments

🧪 Development & Testing

Setting Up Development Environment

Running Tests

Supported Python Versions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes