Enterprise-grade browser automation with advanced features

These details have not been verified by PyPI

Project links

Project description

Multi-Browser Crawler

A clean, focused browser automation package for web scraping and content extraction.

🎯 Ultra-Clean Architecture

This package provides 4 essential components for browser automation:

BrowserPoolManager: Browser pool management with undetected-chromedriver
ProxyManager: Simple proxy management with Chrome-ready format
DebugPortManager: Thread-safe debug port allocation
BrowserConfig: Clean configuration management

✨ Key Features

Zero Redundancy: Every line serves a purpose
Built-in Features: Image download, API discovery, JS execution
Direct Usage: No unnecessary wrapper layers
Chrome Integration: Undetected-chromedriver for stealth browsing
Proxy Support: Single regex parsing, Chrome-ready format
Session Management: Persistent and non-persistent browsers
rotating-mitmproxy Integration: Advanced proxy rotation with SSL handling
Google Services Pass-through: Automatic SSL noise reduction

📦 Installation

pip install multi-browser-crawler

🚀 Quick Start

import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def main():
    # Create configuration dict directly
    config = {
        'headless': True,
        'timeout': 30,
        'browser_data_dir': "tmp/browser-data"
    }

    # Initialize browser pool
    browser_pool = BrowserPoolManager(config)

    try:
        # Fetch a webpage
        result = await browser_pool.fetch_html(
            url="https://example.com",
            session_id=None  # Non-persistent browser
        )

        print(f"✅ Success!")
        print(f"   Title: {result.get('title', 'N/A')}")
        print(f"   Load time: {result.get('load_time', 0):.2f}s")
        print(f"   HTML size: {len(result.get('html', ''))} characters")

    finally:
        await browser_pool.shutdown()

if __name__ == "__main__":
    asyncio.run(main())

🔄 rotating-mitmproxy Integration

Multi-browser-crawler integrates with rotating-mitmproxy for advanced proxy rotation and SSL certificate handling.

Quick Setup

Start the proxy server:

./restart_proxy_server.sh --verbose quiet --web-port 0

Configure browser to use proxy:

config = BrowserConfig(
    proxy_url="http://localhost:3129",  # rotating-mitmproxy default port
    visible=True,
    timeout=60
)

Key Benefits

Automatic SSL certificate handling: No manual certificate installation
Built-in Google services pass-through: Hardcoded elimination of SSL noise for Google APIs
Zero configuration: Google services ignored automatically out of the box
Proxy rotation: Automatic switching across multiple proxy servers
Health checking: Automatic proxy validation and failover

Certificate Management

The ~/.mitmproxy/ folder and certificates are automatically generated on first run:

No manual setup required
Fresh CA certificate created automatically
Browser arguments applied automatically when using proxy

📖 Complete Integration Guide

⚙️ Configuration

Basic Configuration

from multi_browser_crawler import BrowserPoolManager

config = {
    'headless': True,                    # Run in headless mode
    'timeout': 30,                       # Page load timeout in seconds
    'browser_data_dir': "tmp/browsers",  # Browser data directory
    'proxy_url': "http://localhost:3129", # Optional proxy relay URL
    'min_browsers': 1,                   # Minimum browsers in pool
    'max_browsers': 5,                   # Maximum browsers in pool
    'idle_timeout': 300,                 # Browser idle timeout (seconds)
    'debug_port_start': 9222,            # Debug port range start
    'debug_port_end': 9322,              # Debug port range end
}

Environment Variables

export BROWSER_HEADLESS=true
export BROWSER_TIMEOUT=30
export BROWSER_DATA_DIR="/tmp/browsers"
export PROXY_FILE_PATH="/path/to/proxies.txt"
export MIN_BROWSERS=1
export MAX_BROWSERS=5
export DEBUG_PORT_START=9222
export DEBUG_PORT_END=9322

📝 Proxy File Format

Create a proxies.txt file with one proxy per line:

# Basic proxies
127.0.0.1:8080
192.168.1.100:3128
proxy.example.com:8080

# Proxies with authentication
user:pass@192.168.1.1:3128
admin:secret@proxy.example.com:9999

# Complex passwords (supported)
user:complex@pass@host.com:8080

📚 Usage Examples

1. Basic Web Scraping

import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def basic_scraping():
    config = {
        'headless': True,
        'browser_data_dir': "tmp/browser-data"
    }

    browser_pool = BrowserPoolManager(config)

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/html",
            session_id=None
        )

        print(f"Status: {result['status']}")
        print(f"HTML: {result['html'][:100]}...")

    finally:
        await browser_pool.shutdown()

asyncio.run(basic_scraping())

2. Using Proxies

async def proxy_scraping():
    config = {
        'headless': True,
        'browser_data_dir': "tmp/browser-data",
        'proxy_url': "http://localhost:3129"  # Use proxy relay
    }

    browser_pool = BrowserPoolManager(config)

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/ip",
            session_id=None
        )

        print(f"IP: {result['html']}")

    finally:
        await browser_pool.shutdown()

3. Persistent Sessions

async def persistent_session():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        # First request - set cookie
        result1 = await browser_pool.fetch_html(
            url="https://httpbin.org/cookies/set/test/value123",
            session_id="my_session"  # Persistent session
        )

        # Second request - check cookie (same session)
        result2 = await browser_pool.fetch_html(
            url="https://httpbin.org/cookies",
            session_id="my_session"  # Same session
        )

        print("Cookie persisted between requests!")

    finally:
        await browser_pool.shutdown()

4. JavaScript Execution

async def javascript_execution():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/html",
            session_id=None,
            js_action="document.title = 'Modified by JS';"
        )

        print(f"Modified title: {result.get('title')}")

    finally:
        await browser_pool.shutdown()

5. Image Downloading

async def download_images():
    config = BrowserConfig(
        browser_data_dir="tmp/browser-data",
        download_images_dir="tmp/images"
    )

    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://example.com",
            session_id=None,
            download_images=True  # Enable image downloading
        )

        print(f"Downloaded images: {result.get('downloaded_images', [])}")

    finally:
        await browser_pool.shutdown()

6. API Discovery

async def api_discovery():
    config = {'browser_data_dir': "tmp/browser-data"}
    browser_pool = BrowserPoolManager(config)

    try:
        result = await browser_pool.fetch_html(
            url="https://spa-app.example.com",
            session_id=None,
            capture_api_calls=True  # Enable API discovery
        )

        print(f"API calls: {result.get('api_calls', [])}")

    finally:
        await browser_pool.shutdown()

🖥️ CLI Usage

# Fetch a single URL
python -m multi_browser_crawler.browser_cli fetch https://example.com

# Fetch with proxy
python -m multi_browser_crawler.browser_cli fetch https://example.com --proxy-file proxies.txt

# Test proxies
python -m multi_browser_crawler.browser_cli test-proxies proxies.txt

🧪 Testing

Quick Test Run

# Run all tests with the test runner
python run_tests.py

# Run only quick tests (skip slow real-world tests)
python run_tests.py --quick

# Run specific test categories
python run_tests.py --unit          # Unit tests only
python run_tests.py --integration   # Integration tests only
python run_tests.py --cleanup       # Cleanup and ad blocking tests
python run_tests.py --realworld     # Real-world site tests

Manual Testing

# Test page cleanup and ad blocking features
python tests/test_cleanup_and_adblock.py

# Test real-world sites (SlickDeals, WenXueCity, Creaders)
python tests/test_realworld_sites.py

# Test enhanced features (API discovery, image download)
python tests/test_enhanced_features_manual.py

# Test API pattern matching
python tests/test_api_pattern_matching.py

Pytest Testing

# Run all pytest tests
python -m pytest tests/ -v

# Run specific test files
python -m pytest tests/test_browser.py -v
python -m pytest tests/test_enhanced_features.py -v
python -m pytest tests/test_cleanup_adblock_realworld.py -v

# Run tests with markers
python -m pytest tests/ -v -m "not slow"  # Skip slow tests

Usage Examples

# Run usage examples
python examples/01_basic_usage.py
python examples/02_advanced_features.py
python examples/03_session_management.py
python examples/04_enhanced_features.py

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📞 Support

GitHub Issues: Report bugs and request features
Documentation: Coding Principles and Examples
Examples: Comprehensive usage patterns and principle demonstrations

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.5

Sep 4, 2025

0.5.4

Sep 4, 2025

0.5.3

Sep 4, 2025

0.5.2

Sep 4, 2025

0.5.1

Sep 4, 2025

0.5.0

Sep 4, 2025

0.4.0

Sep 3, 2025

0.3.0

Sep 1, 2025

This version

0.2.4

Sep 1, 2025

0.2.0

Aug 20, 2025

0.1.1

Aug 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multi_browser_crawler-0.2.4.tar.gz (53.3 kB view details)

Uploaded Sep 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multi_browser_crawler-0.2.4-py3-none-any.whl (55.4 kB view details)

Uploaded Sep 1, 2025 Python 3

File details

Details for the file multi_browser_crawler-0.2.4.tar.gz.

File metadata

Download URL: multi_browser_crawler-0.2.4.tar.gz
Upload date: Sep 1, 2025
Size: 53.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for multi_browser_crawler-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`c39f29d41cb454f73c6e5f4f5491158365315970067319e8a96e7f4e991d1134`
MD5	`d260d36c6865f1173f28694734c298d0`
BLAKE2b-256	`daf049051d9068581eef98361352ebeb40d9212e709b6d9a565450ac3df5c180`

See more details on using hashes here.

File details

Details for the file multi_browser_crawler-0.2.4-py3-none-any.whl.

File metadata

Download URL: multi_browser_crawler-0.2.4-py3-none-any.whl
Upload date: Sep 1, 2025
Size: 55.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for multi_browser_crawler-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d739652a174b1196ec282f1321b4611e3af07c22d740438319a59b57a29af648`
MD5	`8ff34e4b963bc607cfa65d8dde89a525`
BLAKE2b-256	`372e4ca36508077e56d83cc33791edf7473618db59f8de34cf2fc2faac0440da`

See more details on using hashes here.

multi-browser-crawler 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Multi-Browser Crawler

🎯 Ultra-Clean Architecture

✨ Key Features

📦 Installation

🚀 Quick Start

🔄 rotating-mitmproxy Integration

Quick Setup

Key Benefits

Certificate Management

⚙️ Configuration

Basic Configuration

Environment Variables

📝 Proxy File Format

📚 Usage Examples

1. Basic Web Scraping

2. Using Proxies

3. Persistent Sessions

4. JavaScript Execution

5. Image Downloading

6. API Discovery

🖥️ CLI Usage

🧪 Testing

Quick Test Run

Manual Testing

Pytest Testing

Usage Examples

📄 License

🤝 Contributing

📞 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes