Enterprise-grade browser automation with advanced features
Project description
Multi-Browser Crawler
A clean, focused browser automation package for web scraping and content extraction.
🎯 Ultra-Clean Architecture
This package provides 4 essential components for browser automation:
- BrowserPoolManager: Browser pool management with undetected-chromedriver
- ProxyManager: Simple proxy management with Chrome-ready format
- DebugPortManager: Thread-safe debug port allocation
- BrowserConfig: Clean configuration management
✨ Key Features
- Zero Redundancy: Every line serves a purpose
- Built-in Features: Image download, API discovery, JS execution
- Direct Usage: No unnecessary wrapper layers
- Chrome Integration: Undetected-chromedriver for stealth browsing
- Proxy Support: Single regex parsing, Chrome-ready format
- Session Management: Persistent and non-persistent browsers
📦 Installation
pip install multi-browser-crawler
🚀 Quick Start
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig
async def main():
# Create configuration
config = BrowserConfig(
headless=True,
timeout=30,
browser_data_dir="tmp/browser-data"
)
# Initialize browser pool
browser_pool = BrowserPoolManager(config.to_dict())
try:
# Fetch a webpage
result = await browser_pool.fetch_html(
url="https://example.com",
session_id=None # Non-persistent browser
)
print(f"✅ Success!")
print(f" Title: {result.get('title', 'N/A')}")
print(f" Load time: {result.get('load_time', 0):.2f}s")
print(f" HTML size: {len(result.get('html', ''))} characters")
finally:
await browser_pool.shutdown()
if __name__ == "__main__":
asyncio.run(main())
⚙️ Configuration
Basic Configuration
from multi_browser_crawler import BrowserConfig
config = BrowserConfig(
headless=True, # Run in headless mode
timeout=30, # Page load timeout in seconds
browser_data_dir="tmp/browsers", # Browser data directory
proxy_file_path="proxies.txt", # Optional proxy file
min_browsers=1, # Minimum browsers in pool
max_browsers=5, # Maximum browsers in pool
idle_timeout=300, # Browser idle timeout (seconds)
debug_port_start=9222, # Debug port range start
debug_port_end=9322, # Debug port range end
)
Environment Variables
export BROWSER_HEADLESS=true
export BROWSER_TIMEOUT=30
export BROWSER_DATA_DIR="/tmp/browsers"
export PROXY_FILE_PATH="/path/to/proxies.txt"
export MIN_BROWSERS=1
export MAX_BROWSERS=5
export DEBUG_PORT_START=9222
export DEBUG_PORT_END=9322
📝 Proxy File Format
Create a proxies.txt file with one proxy per line:
# Basic proxies
127.0.0.1:8080
192.168.1.100:3128
proxy.example.com:8080
# Proxies with authentication
user:pass@192.168.1.1:3128
admin:secret@proxy.example.com:9999
# Complex passwords (supported)
user:complex@pass@host.com:8080
📚 Usage Examples
1. Basic Web Scraping
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig
async def basic_scraping():
config = BrowserConfig(
headless=True,
browser_data_dir="tmp/browser-data"
)
browser_pool = BrowserPoolManager(config.to_dict())
try:
result = await browser_pool.fetch_html(
url="https://httpbin.org/html",
session_id=None
)
print(f"Status: {result['status']}")
print(f"HTML: {result['html'][:100]}...")
finally:
await browser_pool.shutdown()
asyncio.run(basic_scraping())
2. Using Proxies
async def proxy_scraping():
config = BrowserConfig(
headless=True,
browser_data_dir="tmp/browser-data",
proxy_file_path="proxies.txt" # Use proxy file
)
browser_pool = BrowserPoolManager(config.to_dict())
try:
result = await browser_pool.fetch_html(
url="https://httpbin.org/ip",
session_id=None,
use_proxy=True # Enable proxy usage
)
print(f"IP: {result['html']}")
finally:
await browser_pool.shutdown()
3. Persistent Sessions
async def persistent_session():
config = BrowserConfig(browser_data_dir="tmp/browser-data")
browser_pool = BrowserPoolManager(config.to_dict())
try:
# First request - set cookie
result1 = await browser_pool.fetch_html(
url="https://httpbin.org/cookies/set/test/value123",
session_id="my_session" # Persistent session
)
# Second request - check cookie (same session)
result2 = await browser_pool.fetch_html(
url="https://httpbin.org/cookies",
session_id="my_session" # Same session
)
print("Cookie persisted between requests!")
finally:
await browser_pool.shutdown()
4. JavaScript Execution
async def javascript_execution():
config = BrowserConfig(browser_data_dir="tmp/browser-data")
browser_pool = BrowserPoolManager(config.to_dict())
try:
result = await browser_pool.fetch_html(
url="https://httpbin.org/html",
session_id=None,
js_action="document.title = 'Modified by JS';"
)
print(f"Modified title: {result.get('title')}")
finally:
await browser_pool.shutdown()
5. Image Downloading
async def download_images():
config = BrowserConfig(
browser_data_dir="tmp/browser-data",
download_images_dir="tmp/images"
)
browser_pool = BrowserPoolManager(config.to_dict())
try:
result = await browser_pool.fetch_html(
url="https://example.com",
session_id=None,
download_images=True # Enable image downloading
)
print(f"Downloaded images: {result.get('downloaded_images', [])}")
finally:
await browser_pool.shutdown()
6. API Discovery
async def api_discovery():
config = BrowserConfig(browser_data_dir="tmp/browser-data")
browser_pool = BrowserPoolManager(config.to_dict())
try:
result = await browser_pool.fetch_html(
url="https://spa-app.example.com",
session_id=None,
capture_api_calls=True # Enable API discovery
)
print(f"API calls: {result.get('api_calls', [])}")
finally:
await browser_pool.shutdown()
🖥️ CLI Usage
# Fetch a single URL
python -m multi_browser_crawler.browser_cli fetch https://example.com
# Fetch with proxy
python -m multi_browser_crawler.browser_cli fetch https://example.com --proxy-file proxies.txt
# Test proxies
python -m multi_browser_crawler.browser_cli test-proxies proxies.txt
🧪 Testing
Run the test suite:
# Run all tests
python tests/test_browser.py
python tests/test_proxy_manager.py
python tests/test_debug_port_manager.py
# Run usage examples
python examples/01_basic_usage.py
python examples/02_advanced_features.py
📄 License
MIT License - see LICENSE file for details.
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
📞 Support
- GitHub Issues: Report bugs and request features
- Documentation: Coding Principles and Examples
- Examples: Comprehensive usage patterns and principle demonstrations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multi_browser_crawler-0.2.0.tar.gz.
File metadata
- Download URL: multi_browser_crawler-0.2.0.tar.gz
- Upload date:
- Size: 30.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
103879f5c5e383da678f3fc1f80d31a8180bbc1504ca0cd337731aa96d14e0f3
|
|
| MD5 |
b1696ca9b04a189bde3ef08adf0c4f58
|
|
| BLAKE2b-256 |
d25427f6b3d62e7c4728bf68df205ceaf162821765b51dceb4c6c8d4a366da0e
|
File details
Details for the file multi_browser_crawler-0.2.0-py3-none-any.whl.
File metadata
- Download URL: multi_browser_crawler-0.2.0-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
764746bc0619dc1fccfdc143f337b867b403ae8ccd40987756ae6f64df27ffe8
|
|
| MD5 |
50a0e10df7dd825b37b4dcff3c6ba988
|
|
| BLAKE2b-256 |
660f79621f80bcbea80174f152d2c36d0026a08353a020336906d156053ae2bd
|