Focused browser automation package for web scraping and content extraction
Project description
Multi-Browser Crawler
A focused browser automation package for web scraping and content extraction.
Features
- Browser Pool Management: Auto-scaling browser pools with session management
- Proxy Support: Built-in proxy rotation and management
- Image Download: Automatic image capture and localization
- API Discovery: Network request capture and pattern matching
- Session Persistence: Stateful browsing with cookie/session support
Installation
pip install multi-browser-crawler
Quick Start
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig
async def main():
# Simple configuration
config = BrowserConfig(headless=True, timeout=30)
pool = BrowserPoolManager(config.to_dict())
try:
await pool.initialize()
# Fetch HTML
result = await pool.fetch_html(
url="https://example.com",
session_id="my_session"
)
if result['status']['success']:
print(f"✅ Success! Title: {result.get('title', 'N/A')}")
print(f"HTML size: {len(result.get('html', ''))} characters")
else:
print(f"❌ Error: {result['status'].get('error')}")
finally:
await pool.shutdown()
if __name__ == "__main__":
asyncio.run(main())
Configuration Options
config = BrowserConfig(
headless=True, # Run in headless mode
timeout=30, # Page load timeout (seconds)
min_browsers=1, # Minimum browsers in pool
max_browsers=5, # Maximum browsers in pool
proxy_url="http://proxy:8080", # Optional proxy URL
download_images_dir="/tmp/images" # Image download directory
)
API Methods
fetch_html()
result = await pool.fetch_html(
url="https://example.com",
session_id="optional_session", # For persistent sessions
timeout=30, # Request timeout
api_patterns=["*/api/*"], # Capture API calls
images_to_capture=["*.jpg", "*.png"] # Download images
)
Response format:
{
'status': {'success': True, 'url': '...', 'load_time': 1.23},
'html': '<html>...</html>',
'title': 'Page Title',
'api_calls': [...], # Captured API requests
'images': [...] # Downloaded images
}
Session Management
# Persistent session - maintains cookies/state
result1 = await pool.fetch_html(url="https://site.com/login", session_id="user1")
result2 = await pool.fetch_html(url="https://site.com/profile", session_id="user1")
# Non-persistent - fresh browser each time
result3 = await pool.fetch_html(url="https://site.com", session_id=None)
Proxy Support
# Single proxy
config = BrowserConfig(proxy_url="http://proxy:8080")
# The package integrates with rotating-mitmproxy for advanced proxy rotation
Testing
# Run all tests
python -m pytest tests/ -v
# Run specific test categories
python -m pytest tests/ -m "not slow" -v
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multi_browser_crawler-0.4.0.tar.gz.
File metadata
- Download URL: multi_browser_crawler-0.4.0.tar.gz
- Upload date:
- Size: 52.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
141e5d86d2ff069afd89438eff2d20215aef4d47487acdc200079554169a30e2
|
|
| MD5 |
0a01dfc34cd2f315e7f49824c1668adf
|
|
| BLAKE2b-256 |
907e9201bcbdb78e168284fe206e1bf0ebbab3ef097eeafedfa7c989c964cd84
|
File details
Details for the file multi_browser_crawler-0.4.0-py3-none-any.whl.
File metadata
- Download URL: multi_browser_crawler-0.4.0-py3-none-any.whl
- Upload date:
- Size: 57.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9691170a11af5bd6591535fa4891389de1be9d09b50884cdd9292b579ad6835a
|
|
| MD5 |
57c258ffe50d6dac6b457deb1618d550
|
|
| BLAKE2b-256 |
aec9566d3e97747d6906020b6e5e1d932115ff807d34c0d2bbd6a398ba7194a9
|