Skip to main content

Enterprise-grade browser automation with advanced features

Project description

Multi-Browser Crawler

A clean, focused browser automation package for web scraping and content extraction.

🎯 Ultra-Clean Architecture

This package provides 4 essential components for browser automation:

  • BrowserPoolManager: Browser pool management with undetected-chromedriver
  • ProxyManager: Simple proxy management with Chrome-ready format
  • DebugPortManager: Thread-safe debug port allocation
  • BrowserConfig: Clean configuration management

Key Features

  • Zero Redundancy: Every line serves a purpose
  • Built-in Features: Image download, API discovery, JS execution
  • Direct Usage: No unnecessary wrapper layers
  • Chrome Integration: Undetected-chromedriver for stealth browsing
  • Proxy Support: Single regex parsing, Chrome-ready format
  • Session Management: Persistent and non-persistent browsers

📦 Installation

pip install multi-browser-crawler

🚀 Quick Start

import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def main():
    # Create configuration
    config = BrowserConfig(
        headless=True,
        timeout=30,
        browser_data_dir="tmp/browser-data"
    )

    # Initialize browser pool
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        # Fetch a webpage
        result = await browser_pool.fetch_html(
            url="https://example.com",
            session_id=None  # Non-persistent browser
        )

        print(f"✅ Success!")
        print(f"   Title: {result.get('title', 'N/A')}")
        print(f"   Load time: {result.get('load_time', 0):.2f}s")
        print(f"   HTML size: {len(result.get('html', ''))} characters")

    finally:
        await browser_pool.shutdown()

if __name__ == "__main__":
    asyncio.run(main())

⚙️ Configuration

Basic Configuration

from multi_browser_crawler import BrowserConfig

config = BrowserConfig(
    headless=True,                    # Run in headless mode
    timeout=30,                       # Page load timeout in seconds
    browser_data_dir="tmp/browsers",  # Browser data directory
    proxy_file_path="proxies.txt",    # Optional proxy file
    min_browsers=1,                   # Minimum browsers in pool
    max_browsers=5,                   # Maximum browsers in pool
    idle_timeout=300,                 # Browser idle timeout (seconds)
    debug_port_start=9222,            # Debug port range start
    debug_port_end=9322,              # Debug port range end
)

Environment Variables

export BROWSER_HEADLESS=true
export BROWSER_TIMEOUT=30
export BROWSER_DATA_DIR="/tmp/browsers"
export PROXY_FILE_PATH="/path/to/proxies.txt"
export MIN_BROWSERS=1
export MAX_BROWSERS=5
export DEBUG_PORT_START=9222
export DEBUG_PORT_END=9322

📝 Proxy File Format

Create a proxies.txt file with one proxy per line:

# Basic proxies
127.0.0.1:8080
192.168.1.100:3128
proxy.example.com:8080

# Proxies with authentication
user:pass@192.168.1.1:3128
admin:secret@proxy.example.com:9999

# Complex passwords (supported)
user:complex@pass@host.com:8080

📚 Usage Examples

1. Basic Web Scraping

import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def basic_scraping():
    config = BrowserConfig(
        headless=True,
        browser_data_dir="tmp/browser-data"
    )

    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/html",
            session_id=None
        )

        print(f"Status: {result['status']}")
        print(f"HTML: {result['html'][:100]}...")

    finally:
        await browser_pool.shutdown()

asyncio.run(basic_scraping())

2. Using Proxies

async def proxy_scraping():
    config = BrowserConfig(
        headless=True,
        browser_data_dir="tmp/browser-data",
        proxy_file_path="proxies.txt"  # Use proxy file
    )

    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/ip",
            session_id=None,
            use_proxy=True  # Enable proxy usage
        )

        print(f"IP: {result['html']}")

    finally:
        await browser_pool.shutdown()

3. Persistent Sessions

async def persistent_session():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        # First request - set cookie
        result1 = await browser_pool.fetch_html(
            url="https://httpbin.org/cookies/set/test/value123",
            session_id="my_session"  # Persistent session
        )

        # Second request - check cookie (same session)
        result2 = await browser_pool.fetch_html(
            url="https://httpbin.org/cookies",
            session_id="my_session"  # Same session
        )

        print("Cookie persisted between requests!")

    finally:
        await browser_pool.shutdown()

4. JavaScript Execution

async def javascript_execution():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/html",
            session_id=None,
            js_action="document.title = 'Modified by JS';"
        )

        print(f"Modified title: {result.get('title')}")

    finally:
        await browser_pool.shutdown()

5. Image Downloading

async def download_images():
    config = BrowserConfig(
        browser_data_dir="tmp/browser-data",
        download_images_dir="tmp/images"
    )

    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://example.com",
            session_id=None,
            download_images=True  # Enable image downloading
        )

        print(f"Downloaded images: {result.get('downloaded_images', [])}")

    finally:
        await browser_pool.shutdown()

6. API Discovery

async def api_discovery():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://spa-app.example.com",
            session_id=None,
            capture_api_calls=True  # Enable API discovery
        )

        print(f"API calls: {result.get('api_calls', [])}")

    finally:
        await browser_pool.shutdown()

🖥️ CLI Usage

# Fetch a single URL
python -m multi_browser_crawler.browser_cli fetch https://example.com

# Fetch with proxy
python -m multi_browser_crawler.browser_cli fetch https://example.com --proxy-file proxies.txt

# Test proxies
python -m multi_browser_crawler.browser_cli test-proxies proxies.txt

🧪 Testing

Run the test suite:

# Run all tests
python tests/test_browser.py
python tests/test_proxy_manager.py
python tests/test_debug_port_manager.py

# Run usage examples
python examples/01_basic_usage.py
python examples/02_advanced_features.py

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

📞 Support

  • GitHub Issues: Report bugs and request features
  • Documentation: Coding Principles and Examples
  • Examples: Comprehensive usage patterns and principle demonstrations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multi_browser_crawler-0.2.0.tar.gz (30.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

multi_browser_crawler-0.2.0-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file multi_browser_crawler-0.2.0.tar.gz.

File metadata

  • Download URL: multi_browser_crawler-0.2.0.tar.gz
  • Upload date:
  • Size: 30.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for multi_browser_crawler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 103879f5c5e383da678f3fc1f80d31a8180bbc1504ca0cd337731aa96d14e0f3
MD5 b1696ca9b04a189bde3ef08adf0c4f58
BLAKE2b-256 d25427f6b3d62e7c4728bf68df205ceaf162821765b51dceb4c6c8d4a366da0e

See more details on using hashes here.

File details

Details for the file multi_browser_crawler-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for multi_browser_crawler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 764746bc0619dc1fccfdc143f337b867b403ae8ccd40987756ae6f64df27ffe8
MD5 50a0e10df7dd825b37b4dcff3c6ba988
BLAKE2b-256 660f79621f80bcbea80174f152d2c36d0026a08353a020336906d156053ae2bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page