Skip to main content

Smart web scraper that abstracts away complexity - from simple sites to highly protected ones.

Project description

IntelliScraper

A powerful anti-bot detection async web scraping library built on Playwright. Designed for scraping protected sites job platforms, social networks, e-commerce dashboardsthat require authentication and sophisticated anti-detection.

PyPI Version Documentation Python Version License Status


๐Ÿ“– Documentation

For detailed guides, tutorials, and full API reference, please visit our official documentation.


โœจ Features

Feature Description
๐Ÿ” Session Management Capture and reuse authentication sessions (cookies, localStorage, fingerprints)
๐Ÿ–ฅ๏ธ Local Browser Mode Connect to your running Chrome via CDP all existing logins available instantly
๐Ÿค– Managed Browser Mode Launch headless Chromium with fingerprint spoofing and anti-detection
โฑ๏ธ Rate Limiting Token-bucket rate limiter shared across all concurrent pages
๐Ÿ“ฆ Batch Scraping batch_scrape() for processing hundreds of URLs with concurrency + rate control
๐Ÿ›ก๏ธ Anti-Detection WebDriver flag removal, plugin spoofing, WebGL masking, human-like scrolling
๐ŸŒ Proxy Support Bright Data integration and custom proxy providers
๐Ÿ“ Extensible Parsers HTML โ†’ text, links, Markdown. Extend for site-specific parsing
โšก Fully Async Built with async/await for maximum concurrency

๐Ÿš€ Quick Start

Installation

# Install the package
pip install intelliscraper-core

# Install Playwright browser (Chromium)
playwright install chromium

[!NOTE] Playwright requires browser binaries installed separately. The command above installs Chromium.


โšก Basic Scraping

import asyncio
from intelliscraper import AsyncScraper, ScrapStatus

async def main():
    async with AsyncScraper() as scraper:
        response = await scraper.scrape("https://example.com")

        if response.status == ScrapStatus.SUCCESS:
            print(f"HTTP {response.http_status_code}")
            print(f"Time: {response.elapsed_time:.2f}s")
            print(response.scrap_html_content[:500])

asyncio.run(main())

๐Ÿ“ฆ Batch Scraping with Rate Limiting

Scrape many URLs with automatic rate limiting and concurrency control:

import asyncio
from intelliscraper import AsyncScraper, ScrapStatus

async def main():
    async with AsyncScraper(
        max_concurrent_pages=4,
        max_requests_per_minute=900,  # 15 requests/sec across all pages
    ) as scraper:
        urls = [f"https://example.com/page/{i}" for i in range(100)]
        results = await scraper.batch_scrape(urls)

        for result in results:
            print(
                f"{result.scrape_request.url} โ†’ "
                f"{result.status.value} "
                f"(HTTP {result.http_status_code}, "
                f"{result.elapsed_time:.2f}s)"
            )

asyncio.run(main())

[!IMPORTANT] The rate limit is shared across all concurrent pages. With max_concurrent_pages=4 and max_requests_per_minute=900, the 4 pages share a combined budget of 15 requests/second not 15/sec each.


๐Ÿ–ฅ๏ธ Local Browser Mode (CDP)

Connect to your running Chrome instance to reuse existing logins (LinkedIn, Gmail, etc.).

Setup (one-time)

# 1. Create the debug profile
make chrome-debug-profile

# 2. Open Chrome with the debug profile and log into your target sites
make chrome-debug-login URL=https://www.linkedin.com

# 3. Log in to the site in the browser that opens
# 4. Close Chrome when done

[!WARNING] The debug profile (~/.config/google-chrome-debug) is separate from your default Chrome profile. You must log into target sites in this profile before scraping.

Usage

import asyncio
from intelliscraper import AsyncScraper, ScrapStatus

async def main():
    async with AsyncScraper(
        use_local_browser=True,
        headless=False,
    ) as scraper:
        response = await scraper.scrape(
            "https://www.linkedin.com/jobs/collections/recommended/"
        )

        if response.status == ScrapStatus.SUCCESS:
            print(f"HTTP {response.http_status_code}")
            print(f"Session: {response.session_id}")
            print(f"Mode: {response.browser_mode}")

asyncio.run(main())

How It Works

  1. IntelliScraper checks if Chrome is running with --remote-debugging-port=9222.
  2. If not, it auto-launches Chrome using the debug profile.
  3. Connects via CDP and reuses the existing browser context (all cookies and logins preserved).
  4. Only the pages opened by IntelliScraper are closed on exit your Chrome session stays running.

๐Ÿ” Session-Based Scraping (Managed Browser)

For sites that require authentication without using your local Chrome:

1. Capture a Session

intelliscraper-session \
    --url "https://example.com" \
    --site "example" \
    --output "./example_session.json"

This opens a browser log in, then press Enter. Session data (cookies, localStorage, fingerprint) is saved to JSON.

2. Use the Session

import asyncio
import json
from intelliscraper import AsyncScraper, Session, ScrapStatus

async def main():
    with open("example_session.json") as f:
        session = Session(**json.load(f))

    async with AsyncScraper(session_data=session) as scraper:
        response = await scraper.scrape("https://example.com/dashboard")

        if response.status == ScrapStatus.SUCCESS:
            print(f"Session: {response.session_id}")
            print(response.scrap_html_content[:500])

asyncio.run(main())

๐Ÿ“ HTML Parsing

Default Parser

from intelliscraper.parsers import HTMLParser

parser = HTMLParser(url="https://example.com", html=html_content)
print(parser.text)               # Plain text
print(parser.links)              # List of absolute URLs
print(parser.navigable_links)    # Classified internal/external links
print(parser.markdown)           # Full Markdown
print(parser.markdown_for_llm)   # Cleaned Markdown (for LLM input)

Custom Parsers

Extend HTMLParser for site-specific extraction:

from functools import cached_property
from intelliscraper.parsers import HTMLParser

class MyJobParser(HTMLParser):
    """Custom parser for a job listing site."""

    @cached_property
    def job_title(self) -> str | None:
        tag = self.soup.select_one("h1.job-title")
        return tag.get_text(strip=True) if tag else None

    @cached_property
    def company(self) -> str | None:
        tag = self.soup.select_one("span.company-name")
        return tag.get_text(strip=True) if tag else None

๐ŸŒ Proxy Support

Proxy is used in managed browser mode only (not with local browser / CDP).

Bright Data Proxy

import asyncio
from intelliscraper import AsyncScraper, BrightDataProxy, ScrapStatus

async def main():
    proxy = BrightDataProxy(
        host="brd.superproxy.io",
        port=22225,
        username="your-username",
        password="your-password",
    )

    async with AsyncScraper(proxy=proxy) as scraper:
        response = await scraper.scrape("https://example.com")
        print(f"Status: {response.status.value}")

asyncio.run(main())

Custom Proxy Provider

from intelliscraper import ProxyProvider, Proxy

class MyProxy(ProxyProvider):
    def get_proxy(self) -> Proxy:
        return Proxy(
            server="http://my-proxy.com:8080",
            username="user",
            password="pass",
        )

[!NOTE] All pages within a single AsyncScraper instance share the same proxy. For different proxies, create separate AsyncScraper instances.


๐Ÿ“Š Response Model

Every scrape() and batch_scrape() call returns a ScrapeResponse with:

Field Type Description
scrape_request ScrapeRequest Original request parameters
status ScrapStatus Outcome: SUCCESS, PARTIAL_SUCCESS, FAILED, RATE_LIMITED, BLOCKED, TIMEOUT
http_status_code int | None Actual HTTP status from the server (200, 403, 429, etc.)
elapsed_time float | None Total scrape duration in seconds
scrap_html_content str | None Raw HTML from the page
error_msg str | None Error message on failure
session_id str | None Session site identifier used
browser_mode str | None "local_browser" or "managed_browser"

๐Ÿ—๏ธ Architecture

intelliscraper/
โ”œโ”€โ”€ scraper.py              # AsyncScraper main orchestrator
โ”œโ”€โ”€ rate_limiter.py         # Token-bucket rate limiter
โ”œโ”€โ”€ enums.py                # ScrapStatus, BrowsingMode, HTMLParserType
โ”œโ”€โ”€ exception.py            # Custom exceptions
โ”œโ”€โ”€ utils.py                # URL normalisation utilities
โ”‚
โ”œโ”€โ”€ browser/                # Browser backend strategy pattern
โ”‚   โ”œโ”€โ”€ backend.py          # BrowserBackend ABC
โ”‚   โ”œโ”€โ”€ local.py            # LocalBrowserBackend (CDP)
โ”‚   โ””โ”€โ”€ managed.py          # ManagedBrowserBackend (Playwright)
โ”‚
โ”œโ”€โ”€ parsers/                # Content parsers
โ”‚   โ”œโ”€โ”€ base_parser.py      # BaseParser ABC
โ”‚   โ””โ”€โ”€ html_parser.py      # HTMLParser (general purpose)
โ”‚
โ”œโ”€โ”€ common/
โ”‚   โ”œโ”€โ”€ constants.py        # Browser fingerprints, launch options
โ”‚   โ””โ”€โ”€ models.py           # Pydantic models (Proxy, Session, etc.)
โ”‚
โ”œโ”€โ”€ proxy/
โ”‚   โ”œโ”€โ”€ base.py             # ProxyProvider ABC
โ”‚   โ””โ”€โ”€ brightdata.py       # BrightDataProxy
โ”‚
โ””โ”€โ”€ scripts/
    โ””โ”€โ”€ get_session_data.py # CLI session capture tool

๐Ÿ“‹ Requirements

  • Python 3.12+
  • Playwright + Chromium
  • Compatible with Linux, macOS, and Windows

๐Ÿ› ๏ธ Development

# Install dependencies
make install

# Install Playwright Chromium
make playwright-chromium

# Run tests
make test

# Format code
make format

Chrome Debug Profile Commands

make chrome-debug-profile                        # Create debug profile
make chrome-debug-login URL=https://linkedin.com  # Log in to a site
make chrome-debug-stop                            # Stop Chrome debug

๐Ÿ—บ๏ธ Roadmap

  • โœ… Async scraping with concurrent pages
  • โœ… Local browser mode (CDP)
  • โœ… Session management CLI
  • โœ… Proxy integration (Bright Data)
  • โœ… HTML parsing and Markdown generation
  • โœ… Anti-detection mechanisms
  • โœ… Rate limiting (token bucket)
  • โœ… Batch scraping API
  • โœ… Extensible parser architecture
  • ๐Ÿ”„ Proxy rotation
  • ๐Ÿ”„ Distributed crawler mode
  • ๐Ÿ”„ AI-based content extraction

๐Ÿ“„ License

Licensed under the MIT License.


๐Ÿ“ง Support

For help, issues, or contributions visit the GitHub Issues page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelliscraper_core-0.2.0.tar.gz (87.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intelliscraper_core-0.2.0-py3-none-any.whl (42.1 kB view details)

Uploaded Python 3

File details

Details for the file intelliscraper_core-0.2.0.tar.gz.

File metadata

  • Download URL: intelliscraper_core-0.2.0.tar.gz
  • Upload date:
  • Size: 87.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.12

File hashes

Hashes for intelliscraper_core-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3858f831e5ac53741d025581169f3493f8c193dbfaa50c16bab2333239975617
MD5 a4c99cd38d34ff0a9f0ef47f9e98d7c7
BLAKE2b-256 a06e8949d839bd9c1db31d4a656f8aa0ca6335fd490d463f46f474060a5fb91c

See more details on using hashes here.

File details

Details for the file intelliscraper_core-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for intelliscraper_core-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dfd7fe22d0a695d9ec3fb866e603bdcc5300d4f20bc8c79b1ae389631f0b64b7
MD5 3de8154effe66797741fb99dc9dc147e
BLAKE2b-256 3896be64e27ddb37d60d068c85bb01c3e54fea3f2fad1150de53caa825e84a6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page