Skip to main content

An ultimate stealth scraping library with advanced proxy rotation and auto-parsing.

Project description

Scrawlee

An ultimate stealth scraping library built on top of curl_cffi with advanced proxy rotation, auto-parsing for JSON/HTML, and built-in rate-limiting retries. Fully supports both synchronous AND highly-concurrent asynchronous scraping.

Key Features

  • Ultimate Stealth: Rotates through real-world TLS/JA3 fingerprints (Chrome, Edge, Safari).
  • Asynchronous Engine: Comes with AsyncScrawleeClient for blazing fast, highly-concurrent scraping using asyncio.
  • Auto-Parsing Response: The .auto property automatically returns a parsed Python dictionary or a high-speed selectolax object depending on whether the response is JSON or HTML.
  • Dual-Parser Support: Fetch lightning-fast CSS queries via .html (Selectolax) or utilize robust XPath querying via .lxml (lxml).
  • Cookie Persistence: Instantly save and load authenticated sessions to disk so you never have to log in or solve Cloudflare challenges twice.
  • Smart Retries: Built-in exponential backoff for common HTTP error codes (429, 50x).
  • Advanced Proxy Management: Supports Random, Round-Robin, and Sticky session rotation with off-band automated health checks.

Installation

pip install scrawlee

(Requires Python 3.8+)

Usage Guide

1. Basic Synchronous Scraping

from scrawlee import ScrawleeClient

with ScrawleeClient(impersonate="chrome120") as client:
    res = client.get("https://httpbin.org/get")
    
    # .auto magically returns a Dictionary for JSON API responses!
    print(res.auto['headers']['User-Agent'])
    
    res_html = client.get("https://httpbin.org/html")
    
    # Lightning fast CSS queries via selectolax
    print(res_html.html.css_first("h1").text(strip=True))
    
    # Powerful XPath queries via lxml
    print(res_html.lxml.xpath("//h1/text()")[0])

2. Deep Dive: Extracting Data from HTML

Scrawlee eliminates the need for external parsing libraries like BeautifulSoup. It comes natively packed with two blazing-fast, C-based parsing engines:

Extracting with CSS Selectors (via .html)

The .html property exposes the selectolax engine. It is the fastest way to parse data using standard CSS selectors.

with ScrawleeClient() as client:
    res = client.get("https://example-store.com/products")
    
    # 1. Extract text from a single element
    title = res.html.css_first("h1.product-title").text(strip=True)
    
    # 2. Extract HTML attributes (e.g. data-id, href, src)
    product_id = res.html.css_first("div.product").attributes.get("data-product-id")
    
    # 3. Loop through lists of elements
    for feature_li in res.html.css("ul.features li"):
        print("Feature:", feature_li.text(strip=True))

Extracting with XPath Queries (via .lxml)

If you need complex DOM traversal (e.g., finding a parent element based on its child's value), CSS selectors fall short. The .lxml property provides industry-standard XPath extraction.

with ScrawleeClient() as client:
    res = client.get("https://example-store.com/products")
    
    # Fetch an element exactly using an XPath query
    price = res.lxml.xpath('//div[@class="product-card" and @data-status="in-stock"]//span[@class="price"]/text()')[0]
    print(f"Price is: {price}")

3. High-Speed Asynchronous Scraping

If you need to scrape 1,000 pages concurrently, use AsyncScrawleeClient.

import asyncio
from scrawlee import AsyncScrawleeClient

async def run():
    async with AsyncScrawleeClient() as client:
        # Fire concurrent requests
        res1, res2 = await asyncio.gather(
            client.get("https://httpbin.org/get"),
            client.get("https://httpbin.org/html")
        )
        print("Async HTTPBin Status:", res1.status_code)

asyncio.run(run())

4. Persistent Sessions (Save/Load Cookies)

If you bypass a Datadome/Cloudflare wall or log into a website, save your cookies to disk so you can instantly resume the session tomorrow!

from scrawlee import ScrawleeClient

# Script 1: Save the session
with ScrawleeClient() as client:
    # ... Login logic or bypass challenge ...
    client.save_cookies("twitter_session.json")

# Script 2: Load the session instantly
with ScrawleeClient() as client:
    client.load_cookies("twitter_session.json")
    res = client.get("https://api.twitter.com/protected_route")

5. Advanced Proxy Management

Automatically rotates Proxies and quarantines failing ones.

from scrawlee import ScrawleeClient, ProxyManager

pm = ProxyManager(rotation_strategy="round_robin")
# Accepts raw proxy data
pm.add_proxy(ip="12.34.56.78", port="8080", username="user", password="pwd")

with ScrawleeClient(proxy_manager=pm) as client:
    res = client.get("https://api.myip.com")
    print("Masked IP:", res.auto['ip'])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrawlee-0.1.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrawlee-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file scrawlee-0.1.0.tar.gz.

File metadata

  • Download URL: scrawlee-0.1.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for scrawlee-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6f75173cc7f3899502f16d88a960930e8d03b58db42fc62fe9f80986249635bd
MD5 47060608bc989351786fddc28d11d129
BLAKE2b-256 ff05bba1b3f93df253f78dd360357ccf5d94a3c1e31354e9fda7d0c1f0cbe086

See more details on using hashes here.

File details

Details for the file scrawlee-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrawlee-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for scrawlee-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0133a6b875e6d87ba59b4d544d20e69deb48776248a14553e87044ca0e05ded0
MD5 702545a7b3e9d22764ac8f847012b87a
BLAKE2b-256 7229a54ee446056c558af63c7da28d3ef16dff279cba8e360f08fc3dcb7fcafd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page