Skip to main content

Generate high-quality datasets from web content for AI training

Project description

WebRover 🚀

Python 3.10+ License: MIT

WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.


🌟 Features

  • Smart Web Scraping: Automatically find and scrape relevant content based on topics
  • Multiple Input Formats: Support for JSON, YAML, TXT, and Markdown topic files
  • Async Processing: Fast, concurrent scraping with built-in rate limiting
  • Quality Control: Built-in content validation and cleaning
  • LLM-Ready Output: Structured JSONL format perfect for model training
  • Error Handling: Robust error tracking and recovery mechanisms

🚀 Quick Start

Installation

pip install webrover

Basic Usage

from webrover import WebRover

# Initialize WebRover
rover = WebRover()

# Scrape content from topics
rover.scrape_topics(
    topics=["artificial intelligence", "machine learning"],
    num_websites=100
)

# Save the dataset
rover.save_dataset("my_dataset.jsonl")

Using Topic Files

# From JSON file
rover.scrape_topics(
    topics="topics.json",
    num_websites=100
)

# From Markdown list
rover.scrape_topics(
    topics="topics.md",
    num_websites=100
)

📖 Documentation

Supported Topic File Formats

JSON

{
    "topics": [
        "AI basics",
        "machine learning",
        "deep learning"
    ]
}

YAML

topics:
  - AI basics
  - machine learning
  - deep learning

Markdown

- AI basics
- machine learning
- deep learning

Output Structure

{
    'url': 'https://example.com/article',
    'title': 'Article Title',
    'content': 'Article content...',
    'metadata': {
        'length': 1234,
        'has_title': true,
        'domain': 'example.com'
    }
}

🛠️ Advanced Usage

# Initialize with custom output directory
rover = WebRover(output_dir="my_datasets")

# Get scraping statistics
stats = rover.get_stats()
print(f"Success rate: {stats['success_rate']*100:.1f}%")

# Access dataset programmatically
dataset = rover.get_dataset()

📊 Output Files

  • final_dataset/dataset.jsonl: Main dataset in JSONL format
  • websites_master.json: List of all discovered URLs
  • websites_completed.json: Successfully scraped URLs
  • websites_errors.json: Failed attempts with error details

🔄 Error Handling

WebRover automatically handles common issues:

  • Rate limiting
  • Network timeouts
  • Invalid URLs
  • Blocked requests
  • Malformed content

🚧 Limitations

  • Respects robots.txt and site rate limits
  • Some sites may block automated access
  • Large datasets require more processing time
  • Google search may throttle excessive requests

🗺️ Roadmap

See FUTURE.md for planned features and improvements.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ by Area-25. Special thanks to all contributors.


WebRover: Build better datasets, train better models. 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webrover-0.1.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webrover-0.1.1-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file webrover-0.1.1.tar.gz.

File metadata

  • Download URL: webrover-0.1.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for webrover-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ca42a31fa6349a38ea8d2dd9c6136eda4a9fff2c7d52f03931c5b7fe11b9fd8a
MD5 2ecd243e6c77c0e96eb7592f6d0180d4
BLAKE2b-256 c09ed7bb687d7e5a03e95e7baea1c2e1b0a0c988b5b3fffad60ff71a75d65ff5

See more details on using hashes here.

File details

Details for the file webrover-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: webrover-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for webrover-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ff759f416e9a6666587fd1d6d366fddbc26c009b1d0d1db147e498ce4b0d2933
MD5 45716fc7262707bb72284a8b55001a43
BLAKE2b-256 e1bcbf2834b65c339d61c5b37c4445474abe9a53aa79a4b5f4daf15cd571ffcc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page