Enterprise-grade browser automation with advanced features
Project description
Multi-Browser Crawler
A clean, focused browser automation package for web scraping and content extraction.
🎯 Ultra-Clean Architecture
This package provides 4 essential components for browser automation:
- BrowserPoolManager: Browser pool management with undetected-chromedriver
- ProxyManager: Simple proxy management with Chrome-ready format
- DebugPortManager: Thread-safe debug port allocation
- BrowserConfig: Clean configuration management
✨ Key Features
- Zero Redundancy: Every line serves a purpose
- Built-in Features: Image download, API discovery, JS execution
- Direct Usage: No unnecessary wrapper layers
- Chrome Integration: Undetected-chromedriver for stealth browsing
- Proxy Support: Single regex parsing, Chrome-ready format
- Session Management: Persistent and non-persistent browsers
- rotating-mitmproxy Integration: Advanced proxy rotation with SSL handling
- Google Services Pass-through: Automatic SSL noise reduction
📦 Installation
pip install multi-browser-crawler
🚀 Quick Start
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig
async def main():
# Create configuration dict directly
config = {
'headless': True,
'timeout': 30,
'browser_data_dir': "tmp/browser-data"
}
# Initialize browser pool
browser_pool = BrowserPoolManager(config)
try:
# Fetch a webpage
result = await browser_pool.fetch_html(
url="https://example.com",
session_id=None # Non-persistent browser
)
print(f"✅ Success!")
print(f" Title: {result.get('title', 'N/A')}")
print(f" Load time: {result.get('load_time', 0):.2f}s")
print(f" HTML size: {len(result.get('html', ''))} characters")
finally:
await browser_pool.shutdown()
if __name__ == "__main__":
asyncio.run(main())
🔄 rotating-mitmproxy Integration
Multi-browser-crawler integrates with rotating-mitmproxy for advanced proxy rotation and SSL certificate handling.
Quick Setup
- Start the proxy server:
./restart_proxy_server.sh --verbose quiet --web-port 0
- Configure browser to use proxy:
config = BrowserConfig(
proxy_url="http://localhost:3129", # rotating-mitmproxy default port
visible=True,
timeout=60
)
Key Benefits
- Automatic SSL certificate handling: No manual certificate installation
- Built-in Google services pass-through: Hardcoded elimination of SSL noise for Google APIs
- Zero configuration: Google services ignored automatically out of the box
- Proxy rotation: Automatic switching across multiple proxy servers
- Health checking: Automatic proxy validation and failover
Certificate Management
The ~/.mitmproxy/ folder and certificates are automatically generated on first run:
- No manual setup required
- Fresh CA certificate created automatically
- Browser arguments applied automatically when using proxy
⚙️ Configuration
Basic Configuration
from multi_browser_crawler import BrowserPoolManager
config = {
'headless': True, # Run in headless mode
'timeout': 30, # Page load timeout in seconds
'browser_data_dir': "tmp/browsers", # Browser data directory
'proxy_url': "http://localhost:3129", # Optional proxy relay URL
'min_browsers': 1, # Minimum browsers in pool
'max_browsers': 5, # Maximum browsers in pool
'idle_timeout': 300, # Browser idle timeout (seconds)
'debug_port_start': 9222, # Debug port range start
'debug_port_end': 9322, # Debug port range end
}
Environment Variables
export BROWSER_HEADLESS=true
export BROWSER_TIMEOUT=30
export BROWSER_DATA_DIR="/tmp/browsers"
export PROXY_FILE_PATH="/path/to/proxies.txt"
export MIN_BROWSERS=1
export MAX_BROWSERS=5
export DEBUG_PORT_START=9222
export DEBUG_PORT_END=9322
📝 Proxy File Format
Create a proxies.txt file with one proxy per line:
# Basic proxies
127.0.0.1:8080
192.168.1.100:3128
proxy.example.com:8080
# Proxies with authentication
user:pass@192.168.1.1:3128
admin:secret@proxy.example.com:9999
# Complex passwords (supported)
user:complex@pass@host.com:8080
📚 Usage Examples
1. Basic Web Scraping
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig
async def basic_scraping():
config = {
'headless': True,
'browser_data_dir': "tmp/browser-data"
}
browser_pool = BrowserPoolManager(config)
try:
result = await browser_pool.fetch_html(
url="https://httpbin.org/html",
session_id=None
)
print(f"Status: {result['status']}")
print(f"HTML: {result['html'][:100]}...")
finally:
await browser_pool.shutdown()
asyncio.run(basic_scraping())
2. Using Proxies
async def proxy_scraping():
config = {
'headless': True,
'browser_data_dir': "tmp/browser-data",
'proxy_url': "http://localhost:3129" # Use proxy relay
}
browser_pool = BrowserPoolManager(config)
try:
result = await browser_pool.fetch_html(
url="https://httpbin.org/ip",
session_id=None
)
print(f"IP: {result['html']}")
finally:
await browser_pool.shutdown()
3. Persistent Sessions
async def persistent_session():
config = BrowserConfig(browser_data_dir="tmp/browser-data")
browser_pool = BrowserPoolManager(config.to_dict())
try:
# First request - set cookie
result1 = await browser_pool.fetch_html(
url="https://httpbin.org/cookies/set/test/value123",
session_id="my_session" # Persistent session
)
# Second request - check cookie (same session)
result2 = await browser_pool.fetch_html(
url="https://httpbin.org/cookies",
session_id="my_session" # Same session
)
print("Cookie persisted between requests!")
finally:
await browser_pool.shutdown()
4. JavaScript Execution
async def javascript_execution():
config = BrowserConfig(browser_data_dir="tmp/browser-data")
browser_pool = BrowserPoolManager(config.to_dict())
try:
result = await browser_pool.fetch_html(
url="https://httpbin.org/html",
session_id=None,
js_action="document.title = 'Modified by JS';"
)
print(f"Modified title: {result.get('title')}")
finally:
await browser_pool.shutdown()
5. Image Downloading
async def download_images():
config = BrowserConfig(
browser_data_dir="tmp/browser-data",
download_images_dir="tmp/images"
)
browser_pool = BrowserPoolManager(config.to_dict())
try:
result = await browser_pool.fetch_html(
url="https://example.com",
session_id=None,
download_images=True # Enable image downloading
)
print(f"Downloaded images: {result.get('downloaded_images', [])}")
finally:
await browser_pool.shutdown()
6. API Discovery
async def api_discovery():
config = {'browser_data_dir': "tmp/browser-data"}
browser_pool = BrowserPoolManager(config)
try:
result = await browser_pool.fetch_html(
url="https://spa-app.example.com",
session_id=None,
capture_api_calls=True # Enable API discovery
)
print(f"API calls: {result.get('api_calls', [])}")
finally:
await browser_pool.shutdown()
🖥️ CLI Usage
# Fetch a single URL
python -m multi_browser_crawler.browser_cli fetch https://example.com
# Fetch with proxy
python -m multi_browser_crawler.browser_cli fetch https://example.com --proxy-file proxies.txt
# Test proxies
python -m multi_browser_crawler.browser_cli test-proxies proxies.txt
🧪 Testing
Quick Test Run
# Run all tests with the test runner
python run_tests.py
# Run only quick tests (skip slow real-world tests)
python run_tests.py --quick
# Run specific test categories
python run_tests.py --unit # Unit tests only
python run_tests.py --integration # Integration tests only
python run_tests.py --cleanup # Cleanup and ad blocking tests
python run_tests.py --realworld # Real-world site tests
Manual Testing
# Test page cleanup and ad blocking features
python tests/test_cleanup_and_adblock.py
# Test real-world sites (SlickDeals, WenXueCity, Creaders)
python tests/test_realworld_sites.py
# Test enhanced features (API discovery, image download)
python tests/test_enhanced_features_manual.py
# Test API pattern matching
python tests/test_api_pattern_matching.py
Pytest Testing
# Run all pytest tests
python -m pytest tests/ -v
# Run specific test files
python -m pytest tests/test_browser.py -v
python -m pytest tests/test_enhanced_features.py -v
python -m pytest tests/test_cleanup_adblock_realworld.py -v
# Run tests with markers
python -m pytest tests/ -v -m "not slow" # Skip slow tests
Usage Examples
# Run usage examples
python examples/01_basic_usage.py
python examples/02_advanced_features.py
python examples/03_session_management.py
python examples/04_enhanced_features.py
📄 License
MIT License - see LICENSE file for details.
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
📞 Support
- GitHub Issues: Report bugs and request features
- Documentation: Coding Principles and Examples
- Examples: Comprehensive usage patterns and principle demonstrations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multi_browser_crawler-0.2.4.tar.gz.
File metadata
- Download URL: multi_browser_crawler-0.2.4.tar.gz
- Upload date:
- Size: 53.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c39f29d41cb454f73c6e5f4f5491158365315970067319e8a96e7f4e991d1134
|
|
| MD5 |
d260d36c6865f1173f28694734c298d0
|
|
| BLAKE2b-256 |
daf049051d9068581eef98361352ebeb40d9212e709b6d9a565450ac3df5c180
|
File details
Details for the file multi_browser_crawler-0.2.4-py3-none-any.whl.
File metadata
- Download URL: multi_browser_crawler-0.2.4-py3-none-any.whl
- Upload date:
- Size: 55.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d739652a174b1196ec282f1321b4611e3af07c22d740438319a59b57a29af648
|
|
| MD5 |
8ff34e4b963bc607cfa65d8dde89a525
|
|
| BLAKE2b-256 |
372e4ca36508077e56d83cc33791edf7473618db59f8de34cf2fc2faac0440da
|