Generate high-quality datasets from web content for AI training
Project description
WebRover 🚀
WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.
🌟 Features
- Smart Web Scraping: Automatically find and scrape relevant content based on topics
- Multiple Input Formats: Support for JSON, YAML, TXT, and Markdown topic files
- Async Processing: Fast, concurrent scraping with built-in rate limiting
- Quality Control: Built-in content validation and cleaning
- LLM-Ready Output: Structured JSONL format perfect for model training
- Error Handling: Robust error tracking and recovery mechanisms
🚀 Quick Start
Installation
pip install webrover
Basic Usage
from webrover import WebRover
# Initialize WebRover
rover = WebRover()
# Scrape content from topics
rover.scrape_topics(
topics=["artificial intelligence", "machine learning"],
num_websites=100
)
# Save the dataset
rover.save_dataset("my_dataset.jsonl")
Using Topic Files
# From JSON file
rover.scrape_topics(
topics="topics.json",
num_websites=100
)
# From Markdown list
rover.scrape_topics(
topics="topics.md",
num_websites=100
)
📖 Documentation
Supported Topic File Formats
JSON
{
"topics": [
"AI basics",
"machine learning",
"deep learning"
]
}
YAML
topics:
- AI basics
- machine learning
- deep learning
Markdown
- AI basics
- machine learning
- deep learning
Output Structure
{
'url': 'https://example.com/article',
'title': 'Article Title',
'content': 'Article content...',
'metadata': {
'length': 1234,
'has_title': true,
'domain': 'example.com'
}
}
🛠️ Advanced Usage
# Initialize with custom output directory
rover = WebRover(output_dir="my_datasets")
# Get scraping statistics
stats = rover.get_stats()
print(f"Success rate: {stats['success_rate']*100:.1f}%")
# Access dataset programmatically
dataset = rover.get_dataset()
📊 Output Files
final_dataset/dataset.jsonl: Main dataset in JSONL formatwebsites_master.json: List of all discovered URLswebsites_completed.json: Successfully scraped URLswebsites_errors.json: Failed attempts with error details
🔄 Error Handling
WebRover automatically handles common issues:
- Rate limiting
- Network timeouts
- Invalid URLs
- Blocked requests
- Malformed content
🚧 Limitations
- Respects robots.txt and site rate limits
- Some sites may block automated access
- Large datasets require more processing time
- Google search may throttle excessive requests
🗺️ Roadmap
See FUTURE.md for planned features and improvements.
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
Built with ❤️ by Area-25. Special thanks to all contributors.
WebRover: Build better datasets, train better models. 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webrover-0.1.3.tar.gz.
File metadata
- Download URL: webrover-0.1.3.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ced68e0f35bc9414b0f2078d9c111e916109e480694148b474c93ad992ffed8b
|
|
| MD5 |
19cc9a7dc914135bbc4d4fc026956102
|
|
| BLAKE2b-256 |
3125367cfc3fe3ba3ee4fe69486d2f01a5cf079dcd639b3fb367bacd2408ed75
|
File details
Details for the file webrover-0.1.3-py3-none-any.whl.
File metadata
- Download URL: webrover-0.1.3-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e18abaf43ad08aa30d779dbfea063156e6991ecbae3778ba956ad48dfed9364
|
|
| MD5 |
38ebc38f80b68c95ccc01f582ae7ecfc
|
|
| BLAKE2b-256 |
b3f473b4232e1ee40c776600cf53298754372f2d35636aec750ad68993196cac
|