An ultimate stealth scraping library with advanced proxy rotation and auto-parsing.
Project description
Scrawlee
An ultimate stealth scraping library built on top of curl_cffi with advanced proxy rotation, auto-parsing for JSON/HTML, and built-in rate-limiting retries. Fully supports both synchronous AND highly-concurrent asynchronous scraping.
Key Features
- Ultimate Stealth: Rotates through real-world TLS/JA3 fingerprints (Chrome, Edge, Safari).
- Asynchronous Engine: Comes with
AsyncScrawleeClientfor blazing fast, highly-concurrent scraping usingasyncio. - Auto-Parsing Response: The
.autoproperty automatically returns a parsed Python dictionary or a high-speedselectolaxobject depending on whether the response is JSON or HTML. - Dual-Parser Support: Fetch lightning-fast CSS queries via
.html(Selectolax) or utilize robust XPath querying via.lxml(lxml). - Cookie Persistence: Instantly save and load authenticated sessions to disk so you never have to log in or solve Cloudflare challenges twice.
- Smart Retries: Built-in exponential backoff for common HTTP error codes (429, 50x).
- Advanced Proxy Management: Supports Random, Round-Robin, and Sticky session rotation with off-band automated health checks.
Installation
pip install scrawlee
(Requires Python 3.8+)
Usage Guide
1. Basic Synchronous Scraping
from scrawlee import ScrawleeClient
with ScrawleeClient(impersonate="chrome120") as client:
res = client.get("https://httpbin.org/get")
# .auto magically returns a Dictionary for JSON API responses!
print(res.auto['headers']['User-Agent'])
res_html = client.get("https://httpbin.org/html")
# Lightning fast CSS queries via selectolax
print(res_html.html.css_first("h1").text(strip=True))
# Powerful XPath queries via lxml
print(res_html.lxml.xpath("//h1/text()")[0])
2. Deep Dive: Extracting Data from HTML
Scrawlee eliminates the need for external parsing libraries like BeautifulSoup. It comes natively packed with two blazing-fast, C-based parsing engines:
Extracting with CSS Selectors (via .html)
The .html property exposes the selectolax engine. It is the fastest way to parse data using standard CSS selectors.
with ScrawleeClient() as client:
res = client.get("https://example-store.com/products")
# 1. Extract text from a single element
title = res.html.css_first("h1.product-title").text(strip=True)
# 2. Extract HTML attributes (e.g. data-id, href, src)
product_id = res.html.css_first("div.product").attributes.get("data-product-id")
# 3. Loop through lists of elements
for feature_li in res.html.css("ul.features li"):
print("Feature:", feature_li.text(strip=True))
Extracting with XPath Queries (via .lxml)
If you need complex DOM traversal (e.g., finding a parent element based on its child's value), CSS selectors fall short. The .lxml property provides industry-standard XPath extraction.
with ScrawleeClient() as client:
res = client.get("https://example-store.com/products")
# Fetch an element exactly using an XPath query
price = res.lxml.xpath('//div[@class="product-card" and @data-status="in-stock"]//span[@class="price"]/text()')[0]
print(f"Price is: {price}")
3. High-Speed Asynchronous Scraping
If you need to scrape 1,000 pages concurrently, use AsyncScrawleeClient.
import asyncio
from scrawlee import AsyncScrawleeClient
async def run():
async with AsyncScrawleeClient() as client:
# Fire concurrent requests
res1, res2 = await asyncio.gather(
client.get("https://httpbin.org/get"),
client.get("https://httpbin.org/html")
)
print("Async HTTPBin Status:", res1.status_code)
asyncio.run(run())
4. Persistent Sessions (Save/Load Cookies)
If you bypass a Datadome/Cloudflare wall or log into a website, save your cookies to disk so you can instantly resume the session tomorrow!
from scrawlee import ScrawleeClient
# Script 1: Save the session
with ScrawleeClient() as client:
# ... Login logic or bypass challenge ...
client.save_cookies("twitter_session.json")
# Script 2: Load the session instantly
with ScrawleeClient() as client:
client.load_cookies("twitter_session.json")
res = client.get("https://api.twitter.com/protected_route")
5. Advanced Proxy Management
Automatically rotates Proxies and quarantines failing ones.
from scrawlee import ScrawleeClient, ProxyManager
pm = ProxyManager(rotation_strategy="round_robin")
# Accepts raw proxy data
pm.add_proxy(ip="12.34.56.78", port="8080", username="user", password="pwd")
with ScrawleeClient(proxy_manager=pm) as client:
res = client.get("https://api.myip.com")
print("Masked IP:", res.auto['ip'])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrawlee-0.1.0.tar.gz.
File metadata
- Download URL: scrawlee-0.1.0.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f75173cc7f3899502f16d88a960930e8d03b58db42fc62fe9f80986249635bd
|
|
| MD5 |
47060608bc989351786fddc28d11d129
|
|
| BLAKE2b-256 |
ff05bba1b3f93df253f78dd360357ccf5d94a3c1e31354e9fda7d0c1f0cbe086
|
File details
Details for the file scrawlee-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrawlee-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0133a6b875e6d87ba59b4d544d20e69deb48776248a14553e87044ca0e05ded0
|
|
| MD5 |
702545a7b3e9d22764ac8f847012b87a
|
|
| BLAKE2b-256 |
7229a54ee446056c558af63c7da28d3ef16dff279cba8e360f08fc3dcb7fcafd
|