Smart web scraper that abstracts away complexity - from simple sites to highly protected ones.
Project description
IntelliScraper
A powerful anti-bot detection async web scraping library built on Playwright. Designed for scraping protected sites job platforms, social networks, e-commerce dashboardsthat require authentication and sophisticated anti-detection.
๐ Documentation
For detailed guides, tutorials, and full API reference, please visit our official documentation.
โจ Features
| Feature | Description |
|---|---|
| ๐ Session Management | Capture and reuse authentication sessions (cookies, localStorage, fingerprints) |
| ๐ฅ๏ธ Local Browser Mode | Connect to your running Chrome via CDP all existing logins available instantly |
| ๐ค Managed Browser Mode | Launch headless Chromium with fingerprint spoofing and anti-detection |
| โฑ๏ธ Rate Limiting | Token-bucket rate limiter shared across all concurrent pages |
| ๐ฆ Batch Scraping | batch_scrape() for processing hundreds of URLs with concurrency + rate control |
| ๐ก๏ธ Anti-Detection | WebDriver flag removal, plugin spoofing, WebGL masking, human-like scrolling |
| ๐ Proxy Support | Bright Data integration and custom proxy providers |
| ๐ Extensible Parsers | HTML โ text, links, Markdown. Extend for site-specific parsing |
| โก Fully Async | Built with async/await for maximum concurrency |
๐ Quick Start
Installation
# Install the package
pip install intelliscraper-core
# Install Playwright browser (Chromium)
playwright install chromium
[!NOTE] Playwright requires browser binaries installed separately. The command above installs Chromium.
โก Basic Scraping
import asyncio
from intelliscraper import AsyncScraper, ScrapStatus
async def main():
async with AsyncScraper() as scraper:
response = await scraper.scrape("https://example.com")
if response.status == ScrapStatus.SUCCESS:
print(f"HTTP {response.http_status_code}")
print(f"Time: {response.elapsed_time:.2f}s")
print(response.scrap_html_content[:500])
asyncio.run(main())
๐ฆ Batch Scraping with Rate Limiting
Scrape many URLs with automatic rate limiting and concurrency control:
import asyncio
from intelliscraper import AsyncScraper, ScrapStatus
async def main():
async with AsyncScraper(
max_concurrent_pages=4,
max_requests_per_minute=900, # 15 requests/sec across all pages
) as scraper:
urls = [f"https://example.com/page/{i}" for i in range(100)]
results = await scraper.batch_scrape(urls)
for result in results:
print(
f"{result.scrape_request.url} โ "
f"{result.status.value} "
f"(HTTP {result.http_status_code}, "
f"{result.elapsed_time:.2f}s)"
)
asyncio.run(main())
[!IMPORTANT] The rate limit is shared across all concurrent pages. With
max_concurrent_pages=4andmax_requests_per_minute=900, the 4 pages share a combined budget of 15 requests/second not 15/sec each.
๐ฅ๏ธ Local Browser Mode (CDP)
Connect to your running Chrome instance to reuse existing logins (LinkedIn, Gmail, etc.).
Setup (one-time)
# 1. Create the debug profile
make chrome-debug-profile
# 2. Open Chrome with the debug profile and log into your target sites
make chrome-debug-login URL=https://www.linkedin.com
# 3. Log in to the site in the browser that opens
# 4. Close Chrome when done
[!WARNING] The debug profile (
~/.config/google-chrome-debug) is separate from your default Chrome profile. You must log into target sites in this profile before scraping.
Usage
import asyncio
from intelliscraper import AsyncScraper, ScrapStatus
async def main():
async with AsyncScraper(
use_local_browser=True,
headless=False,
) as scraper:
response = await scraper.scrape(
"https://www.linkedin.com/jobs/collections/recommended/"
)
if response.status == ScrapStatus.SUCCESS:
print(f"HTTP {response.http_status_code}")
print(f"Session: {response.session_id}")
print(f"Mode: {response.browser_mode}")
asyncio.run(main())
How It Works
- IntelliScraper checks if Chrome is running with
--remote-debugging-port=9222. - If not, it auto-launches Chrome using the debug profile.
- Connects via CDP and reuses the existing browser context (all cookies and logins preserved).
- Only the pages opened by IntelliScraper are closed on exit your Chrome session stays running.
๐ Session-Based Scraping (Managed Browser)
For sites that require authentication without using your local Chrome:
1. Capture a Session
intelliscraper-session \
--url "https://example.com" \
--site "example" \
--output "./example_session.json"
This opens a browser log in, then press Enter. Session data (cookies, localStorage, fingerprint) is saved to JSON.
2. Use the Session
import asyncio
import json
from intelliscraper import AsyncScraper, Session, ScrapStatus
async def main():
with open("example_session.json") as f:
session = Session(**json.load(f))
async with AsyncScraper(session_data=session) as scraper:
response = await scraper.scrape("https://example.com/dashboard")
if response.status == ScrapStatus.SUCCESS:
print(f"Session: {response.session_id}")
print(response.scrap_html_content[:500])
asyncio.run(main())
๐ HTML Parsing
Default Parser
from intelliscraper.parsers import HTMLParser
parser = HTMLParser(url="https://example.com", html=html_content)
print(parser.text) # Plain text
print(parser.links) # List of absolute URLs
print(parser.navigable_links) # Classified internal/external links
print(parser.markdown) # Full Markdown
print(parser.markdown_for_llm) # Cleaned Markdown (for LLM input)
Custom Parsers
Extend HTMLParser for site-specific extraction:
from functools import cached_property
from intelliscraper.parsers import HTMLParser
class MyJobParser(HTMLParser):
"""Custom parser for a job listing site."""
@cached_property
def job_title(self) -> str | None:
tag = self.soup.select_one("h1.job-title")
return tag.get_text(strip=True) if tag else None
@cached_property
def company(self) -> str | None:
tag = self.soup.select_one("span.company-name")
return tag.get_text(strip=True) if tag else None
๐ Proxy Support
Proxy is used in managed browser mode only (not with local browser / CDP).
Bright Data Proxy
import asyncio
from intelliscraper import AsyncScraper, BrightDataProxy, ScrapStatus
async def main():
proxy = BrightDataProxy(
host="brd.superproxy.io",
port=22225,
username="your-username",
password="your-password",
)
async with AsyncScraper(proxy=proxy) as scraper:
response = await scraper.scrape("https://example.com")
print(f"Status: {response.status.value}")
asyncio.run(main())
Custom Proxy Provider
from intelliscraper import ProxyProvider, Proxy
class MyProxy(ProxyProvider):
def get_proxy(self) -> Proxy:
return Proxy(
server="http://my-proxy.com:8080",
username="user",
password="pass",
)
[!NOTE] All pages within a single
AsyncScraperinstance share the same proxy. For different proxies, create separateAsyncScraperinstances.
๐ Response Model
Every scrape() and batch_scrape() call returns a ScrapeResponse with:
| Field | Type | Description |
|---|---|---|
scrape_request |
ScrapeRequest |
Original request parameters |
status |
ScrapStatus |
Outcome: SUCCESS, PARTIAL_SUCCESS, FAILED, RATE_LIMITED, BLOCKED, TIMEOUT |
http_status_code |
int | None |
Actual HTTP status from the server (200, 403, 429, etc.) |
elapsed_time |
float | None |
Total scrape duration in seconds |
scrap_html_content |
str | None |
Raw HTML from the page |
error_msg |
str | None |
Error message on failure |
session_id |
str | None |
Session site identifier used |
browser_mode |
str | None |
"local_browser" or "managed_browser" |
๐๏ธ Architecture
intelliscraper/
โโโ scraper.py # AsyncScraper main orchestrator
โโโ rate_limiter.py # Token-bucket rate limiter
โโโ enums.py # ScrapStatus, BrowsingMode, HTMLParserType
โโโ exception.py # Custom exceptions
โโโ utils.py # URL normalisation utilities
โ
โโโ browser/ # Browser backend strategy pattern
โ โโโ backend.py # BrowserBackend ABC
โ โโโ local.py # LocalBrowserBackend (CDP)
โ โโโ managed.py # ManagedBrowserBackend (Playwright)
โ
โโโ parsers/ # Content parsers
โ โโโ base_parser.py # BaseParser ABC
โ โโโ html_parser.py # HTMLParser (general purpose)
โ
โโโ common/
โ โโโ constants.py # Browser fingerprints, launch options
โ โโโ models.py # Pydantic models (Proxy, Session, etc.)
โ
โโโ proxy/
โ โโโ base.py # ProxyProvider ABC
โ โโโ brightdata.py # BrightDataProxy
โ
โโโ scripts/
โโโ get_session_data.py # CLI session capture tool
๐ Requirements
- Python 3.12+
- Playwright + Chromium
- Compatible with Linux, macOS, and Windows
๐ ๏ธ Development
# Install dependencies
make install
# Install Playwright Chromium
make playwright-chromium
# Run tests
make test
# Format code
make format
Chrome Debug Profile Commands
make chrome-debug-profile # Create debug profile
make chrome-debug-login URL=https://linkedin.com # Log in to a site
make chrome-debug-stop # Stop Chrome debug
๐บ๏ธ Roadmap
- โ Async scraping with concurrent pages
- โ Local browser mode (CDP)
- โ Session management CLI
- โ Proxy integration (Bright Data)
- โ HTML parsing and Markdown generation
- โ Anti-detection mechanisms
- โ Rate limiting (token bucket)
- โ Batch scraping API
- โ Extensible parser architecture
- ๐ Proxy rotation
- ๐ Distributed crawler mode
- ๐ AI-based content extraction
๐ License
Licensed under the MIT License.
๐ง Support
For help, issues, or contributions visit the GitHub Issues page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file intelliscraper_core-0.2.0.tar.gz.
File metadata
- Download URL: intelliscraper_core-0.2.0.tar.gz
- Upload date:
- Size: 87.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3858f831e5ac53741d025581169f3493f8c193dbfaa50c16bab2333239975617
|
|
| MD5 |
a4c99cd38d34ff0a9f0ef47f9e98d7c7
|
|
| BLAKE2b-256 |
a06e8949d839bd9c1db31d4a656f8aa0ca6335fd490d463f46f474060a5fb91c
|
File details
Details for the file intelliscraper_core-0.2.0-py3-none-any.whl.
File metadata
- Download URL: intelliscraper_core-0.2.0-py3-none-any.whl
- Upload date:
- Size: 42.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfd7fe22d0a695d9ec3fb866e603bdcc5300d4f20bc8c79b1ae389631f0b64b7
|
|
| MD5 |
3de8154effe66797741fb99dc9dc147e
|
|
| BLAKE2b-256 |
3896be64e27ddb37d60d068c85bb01c3e54fea3f2fad1150de53caa825e84a6a
|