Skip to main content

Generate high-quality datasets from web content for AI training

Project description

WebRover 🚀

Python 3.10+ License: MIT Version

WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.

WebRover Logo

🌟 Features

  • Smart Web Scraping: Automatically find and scrape relevant content based on topics
  • Multiple Input Formats: Support for JSON, YAML, TXT, and Markdown topic files
  • Async Processing: Fast, concurrent scraping with built-in rate limiting
  • Quality Control: Built-in content validation and cleaning
  • LLM-Ready Output: Structured JSONL format perfect for model training
  • Error Handling: Robust error tracking and recovery mechanisms

⚠️ Important Notes

Cloud Environment Compatibility

When using WebRover in cloud environments like Google Colab or Kaggle Notebooks, you may need to handle nested asyncio loops. This is a limitation of these environments, not WebRover itself. To resolve this:

  1. Install nest_asyncio:
pip install nest_asyncio
  1. Add these lines at the start of your notebook:
import nest_asyncio
nest_asyncio.apply()

This setup is only required for:

  • Google Colab
  • Kaggle Notebooks
  • Similar cloud-based Jupyter environments

It's not needed for:

  • Local Python scripts
  • Command line usage
  • Standard server deployments

🚀 Troubleshooting

Cloud Environment Issues

When using WebRover in cloud environments (Google Colab, Kaggle Notebooks), you may encounter asyncio-related errors. This is due to how these environments handle async operations. To fix:

# Install the required package
pip install nest_asyncio

# Add at the start of your notebook
import nest_asyncio
nest_asyncio.apply()

Common Issues and Solutions

  1. Rate Limiting

    • Symptom: Many HTTP 429 errors
    • Solution: Decrease scraping speed by increasing sleep time between requests
  2. Memory Issues with Large Datasets

    • Symptom: Out of memory errors
    • Solution: Use smaller batch sizes or enable disk caching
  3. Blocked Access

    • Symptom: HTTP 403 Forbidden errors
    • Solution: Ensure your user agent is set correctly and respect robots.txt
  4. SSL Certificate Errors

    • Symptom: SSL verification failed
    • Solution: Update your Python SSL certificates or check network settings

🚀 Quick Start

Installation

pip install webrover

Basic Usage

from webrover import WebRover

# Initialize WebRover
rover = WebRover()

# Scrape content from topics
rover.scrape_topics(
    topics=["artificial intelligence", "machine learning"],
    sites_per_topic=20  # Will get 20 sites for each topic
)

# Save the dataset
rover.save_dataset("my_dataset.jsonl")

Using Topic Files

# From JSON file
rover.scrape_topics(
    topics="topics.json",
    num_websites=100
)

# From Markdown list
rover.scrape_topics(
    topics="topics.md",
    num_websites=100
)

📖 Documentation

Supported Topic File Formats

JSON

{
    "topics": [
        "AI basics",
        "machine learning",
        "deep learning"
    ]
}

YAML

topics:
  - AI basics
  - machine learning
  - deep learning

Markdown

- AI basics
- machine learning
- deep learning

Output Structure

{
    'url': 'https://example.com/article',
    'title': 'Article Title',
    'content': 'Article content...',
    'metadata': {
        'length': 1234,
        'has_title': true,
        'domain': 'example.com'
    }
}

🛠️ Advanced Usage

# Initialize with custom output directory
rover = WebRover(output_dir="my_datasets")

# Get scraping statistics
stats = rover.get_stats()
print(f"Success rate: {stats['success_rate']*100:.1f}%")

# Access dataset programmatically
dataset = rover.get_dataset()

📊 Output Files

  • final_dataset/dataset.jsonl: Main dataset in JSONL format
  • websites_master.json: List of all discovered URLs
  • websites_completed.json: Successfully scraped URLs
  • websites_errors.json: Failed attempts with error details

🔄 Error Handling

WebRover automatically handles common issues:

  • Rate limiting
  • Network timeouts
  • Invalid URLs
  • Blocked requests
  • Malformed content

🚧 Limitations

  • Respects robots.txt and site rate limits
  • Some sites may block automated access
  • Large datasets require more processing time
  • Google search may throttle excessive requests

🗺️ Roadmap

See FUTURE.md for planned features and improvements.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ by Area-25. Special thanks to all contributors.


WebRover: Build better datasets, train better models. 🚀

🧪 Development & Testing

Setting Up Development Environment

  1. Clone the repository:
git clone https://github.com/Area-25/webrover.git
cd webrover
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install development dependencies:
pip install -e ".[tests]"

Running Tests

Run the test suite:

python -m pytest tests/

For test coverage report:

python -m pytest tests/ --cov=webrover

Supported Python Versions

  • Python 3.10
  • Python 3.11

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webrover-0.1.12.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webrover-0.1.12-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file webrover-0.1.12.tar.gz.

File metadata

  • Download URL: webrover-0.1.12.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for webrover-0.1.12.tar.gz
Algorithm Hash digest
SHA256 95a9fbb057de4eb269711483fd6d4246b2aec63fb38eb6d70c7cb87dfe12f14a
MD5 b788caa34eb0cffab552d2768969c779
BLAKE2b-256 91fa2dbc4a007712c680d74759974d0935ad25bc72a4b284fe1e5f552bdbf06a

See more details on using hashes here.

File details

Details for the file webrover-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: webrover-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for webrover-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 b0ce458cfaf44cb0d9bb042f69e1b874ddb5935da66aaff538d369a1efa55ffb
MD5 f042df09addbdfd63fca704d5f4ec14b
BLAKE2b-256 f6a132a791c5231aad8ab7b82dcf965658ad64aa9152ef7083bfd2c8b7c0dcf1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page