A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities
Project description
SwiftCrawl
A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities.
Highlights
- Dual-mode sessions –
SwiftCrawlseamlessly switches between BrowserForge-powered HTTP requests and Camoufox browser automation. - Async-first architecture – every client, crawler component, and CLI workflow is
asynciofriendly for massive concurrency. - Crawler engine – Scrapy-inspired scheduler, downloader, and CLI (
swiftcrawl crawl) with retries, priorities, and Playwright warmup support. - Items & Fields – define strongly-typed
Itemobjects with.Field(serializer=...)hooks for clean output serialization. - Project bootstrapper –
swiftcrawl init <project>scaffolds spiders, settings, and sample items in seconds. - Unified response parsing –
Response.json()/soup()/tree()keep parsing ergonomic across HTTP and browser modes.
Installation
# Initialize with UV (recommended)
uv init --name myproject
cd myproject
# Add SwiftCrawl
uv add swiftcrawl
# Install Camoufox browser
camoufox fetch
Quick Start
HTTP Mode (Fast & Stealthy)
import asyncio
from swiftcrawl import SwiftCrawl
async def main():
async with SwiftCrawl(method='http') as session:
response = await session.get('https://api.example.com/data')
data = response.json()
print(data)
asyncio.run(main())
Browser Mode (Full JS Support)
import asyncio
from swiftcrawl import SwiftCrawl
async def main():
async with SwiftCrawl(method='browser', headless=True) as session:
response = await session.get('https://spa-website.com')
# Parse with BeautifulSoup
soup = response.soup()
title = soup.find('title').string
# Or use XPath
tree = response.tree()
links = tree.xpath('//a/@href')
print(f"Title: {title}")
print(f"Links: {links}")
asyncio.run(main())
Usage Examples
HTTP GET with Custom Headers
async with SwiftCrawl(method='http') as session:
response = await session.get(
'https://api.example.com',
headers={'Authorization': 'Bearer token123'}
)
print(response.json())
HTTP POST
async with SwiftCrawl(method='http') as session:
response = await session.post(
'https://api.example.com/submit',
json={'key': 'value'}
)
print(response.status_code)
Browser GET with Initial URL (Cookie Gathering)
# Visit initial_url first to gather session cookies
async with SwiftCrawl(
method='browser',
initial_url='https://example.com/login',
headless=True
) as session:
# Subsequent requests will have cookies from initial_url
response = await session.get('https://example.com/protected')
print(response.text)
Browser POST via fetch()
# Uses page.evaluate() with fetch() for fast POST requests
async with SwiftCrawl(method='browser', headless=True) as session:
response = await session.post(
'https://api.example.com/endpoint',
data={'username': 'test', 'password': 'secret'}
)
print(response.json())
With Proxy
# HTTP mode
async with SwiftCrawl(
method='http',
proxy='http://proxy.example.com:8080'
) as session:
response = await session.get('https://example.com')
# Browser mode
async with SwiftCrawl(
method='browser',
proxy={'server': 'http://proxy.example.com:8080',
'username': 'user',
'password': 'pass'},
geoip=True # Auto-detect location from proxy
) as session:
response = await session.get('https://example.com')
Browser Warmup Function
The warmup parameter allows you to run a function after the browser initializes but before your main requests. This is perfect for login flows, gathering tokens, or setting up sessions.
async def my_warmup(page):
"""
Warmup function receives the Playwright page object.
Use it to login, set cookies, gather tokens, etc.
"""
await page.goto('https://example.com/login')
# Set authentication cookies
await page.evaluate('''() => {
document.cookie = "auth_token=xyz123; path=/";
document.cookie = "session_id=abc789; path=/";
}''')
print("Logged in and ready!")
# Use warmup with browser mode
async with SwiftCrawl(
method='browser',
warmup=my_warmup # Executes before main requests
) as session:
# Warmup already executed - we have auth cookies now
response = await session.get('https://example.com/protected')
print(response.text)
Key Benefits:
- Automatic login before scraping
- Gather CSRF tokens or API keys
- Set cookies and session data
- Execute complex multi-step setup
- Access full Playwright page object
Scrapy-like Crawler & CLI
SwiftCrawl now ships with a Scrapy-inspired crawler stack and command-line interface.
Defining a Spider
from urllib.parse import urljoin
from swiftcrawl import Request, Spider
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com"]
method = "http" # default, but you can override per domain/URL
async def parse(self, response):
soup = response.soup()
for quote in soup.select(".quote"):
yield {
"text": quote.select_one(".text").text,
"author": quote.select_one(".author").text,
}
next_link = soup.select_one(".next a")
if next_link:
yield Request(
url=urljoin(response.url, next_link["href"]),
callback=self.parse,
)
Running from Python
import asyncio
from swiftcrawl import run_spider
items = run_spider(QuotesSpider)
print(items)
Running from the CLI
Create spiders/quotes_spider.py containing your spider, then run:
# Print stats only
swiftcrawl crawl quotes
# Persist results
swiftcrawl crawl quotes -o output.jsonl
# Enable verbose logging / stack traces
swiftcrawl crawl quotes -o output.jsonl -v
The CLI automatically loads settings.py (if present), discovers spiders from the spiders/ package, prints crawl statistics, and—when -o/--output is provided—writes scraped items to the specified .json or .jsonl file. .json outputs are standard JSON arrays (each item on a single line for easy diffs), while .jsonl outputs remain newline-delimited for streaming. Use -v/--verbose to see detailed request processing, item writes, and full error traces.
Bootstrapping a Project
Need a fresh workspace? Use the built-in initializer:
swiftcrawl init my_scraper
cd my_scraper
swiftcrawl crawl example
init creates spiders/, a sample spider with Items, and a starter settings.py so you can begin crawling immediately.
Item & Field API
Scraped data can be represented as structured Items with optional serialization hooks.
from swiftcrawl import Item, Field
class QuoteItem(Item):
text = Field()
author = Field()
tags = Field(default_factory=list, serializer=lambda values: ",".join(values))
class QuotesSpider(Spider):
...
async def parse(self, response):
for quote in response.soup().select('.quote'):
yield QuoteItem(
text=quote.select_one('.text').text,
author=quote.select_one('.author').text,
tags=[t.text for t in quote.select('.tag')],
)
Items automatically convert to dictionaries (using field serializers) before the crawler writes them to disk.
Response Parsing Methods
async with SwiftCrawl(method='http') as session:
response = await session.get('https://example.com')
# Raw text
html = response.text
# JSON parsing
data = response.json()
# BeautifulSoup
soup = response.soup()
title = soup.find('title').string
# lxml XPath
tree = response.tree()
paragraphs = tree.xpath('//p/text()')
# Metadata
print(response.status_code)
print(response.headers)
print(response.cookies)
print(response.url)
Configuration Options
SwiftCrawl Constructor
SwiftCrawl(
method='http', # 'http', 'browser', or 'auto' (future)
proxy=None, # Proxy URL or config dict
headless=True, # Browser headless mode
block_images=True, # Block images in browser
humanize=None, # Human-like behavior (0.0-2.0)
initial_url=None, # URL to visit first (browser only)
warmup=None, # Async function(page) for browser setup
locale='en-US', # Browser locale
os=['windows', 'macos'], # OS fingerprint options
geoip=False, # Auto-geolocate from proxy
timeout=30.0, # Request timeout (HTTP)
max_concurrent=10, # Queue concurrency limit
)
HTTP Mode Options (BrowserForge + httpx)
- Generates realistic browser headers automatically
- Rotates fingerprints between requests
- Supports all standard httpx parameters
Browser Mode Options (Camoufox)
- headless: Run in headless mode (default: True)
- block_images: Block image loading for speed (default: True)
- humanize: Enable human-like cursor movement (0.0-2.0)
- initial_url: Navigate here first to collect cookies/session
- warmup: Async function that receives the page object for setup (login, cookies, etc.)
- geoip: Auto-detect geolocation from proxy IP
- locale: Browser locale (default: 'en-US')
- os: List of OS to randomly choose from
Parameter Validation
SwiftCrawl validates parameter compatibility and warns you about configuration mistakes:
Errors (ValueError)
Raised when parameters are fundamentally incompatible:
# ERROR: warmup requires browser mode
session = SwiftCrawl(method='http', warmup=my_warmup)
# ValueError: warmup parameter is only supported for 'browser' and 'auto' methods.
Warnings (UserWarning)
Issued when parameters will be ignored:
# WARNING: browser params with HTTP mode
session = SwiftCrawl(
method='http',
headless=False, # Ignored in HTTP mode
humanize=1.5 # Ignored in HTTP mode
)
# UserWarning: Browser-only parameters ['headless', 'humanize'] are ignored in HTTP mode.
# WARNING: timeout with browser mode
session = SwiftCrawl(method='browser', timeout=10.0)
# UserWarning: HTTP timeout parameter is ignored in browser mode.
This helps catch configuration mistakes early and ensures you understand which parameters are being used.
Architecture
SwiftCrawl
method='http' -> AsyncHTTPClient (httpx + BrowserForge)
method='browser' -> AsyncBrowserClient (Camoufox + Playwright)
method='auto' -> Smart detection (coming soon)
Response Object
.text / .html -> Raw content
.json() -> JSON parsing
.soup() -> BeautifulSoup (html.parser)
.tree() -> lxml tree (XPath)
.headers, .cookies, .status_code -> Metadata
Roadmap
- HTTP mode with BrowserForge headers
- Browser mode with Camoufox
- Browser POST via page.evaluate(fetch())
- Session/cookie management with initial_url
- Warmup function for browser initialization
- HTML-wrapped JSON parsing fix
- Parameter validation and warnings
- Unified Response object
- Auto mode (intelligent method selection)
- AsyncIO request queue for bulk processing
- Rate limiting and retry logic
- Middleware system
- Built-in proxy rotation
Dependencies
- httpx - Async HTTP client
- browserforge - Browser fingerprint generation
- camoufox - Anti-detection browser
- playwright - Browser automation (via camoufox)
- beautifulsoup4 - HTML parsing
- lxml - XPath support
Testing
# Run fast, offline-safe suite
uv run pytest
# Include network + browser integration tests (needs internet & Playwright)
EASYSCRAPER_RUN_NETWORK_TESTS=1 uv run pytest
License
MIT
Third-Party Licenses
SwiftCrawl depends on the following open-source libraries. We are grateful to their maintainers and contributors:
| Library | License | Repository |
|---|---|---|
| beautifulsoup4 | MIT | https://www.crummy.com/software/BeautifulSoup/ |
| browserforge | Apache-2.0 | https://github.com/daijro/browserforge |
| camoufox | MPL-2.0 | https://github.com/daijro/camoufox |
| httpx | BSD-3-Clause | https://github.com/encode/httpx |
| lxml | BSD-3-Clause | https://github.com/lxml/lxml |
| playwright | Apache-2.0 | https://github.com/microsoft/playwright-python |
All licenses require attribution. Please review each library's license for specific terms.
Contributing
Contributions are welcome! This is an early-stage project designed for flexible web scraping with anti-detection capabilities.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swiftcrawl-0.1.1.tar.gz.
File metadata
- Download URL: swiftcrawl-0.1.1.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9169b7162f1a2c8d45a42669aa3da7a7c138e4a2d5168bcbaf25e2f391544400
|
|
| MD5 |
616f01c6fb2d2bdf06c3f434de983f02
|
|
| BLAKE2b-256 |
bc43255a041dd1b9824369218837eb5352f968fcdb7bb5ef712adcd1a49d99e0
|
Provenance
The following attestation bundles were made for swiftcrawl-0.1.1.tar.gz:
Publisher:
publish.yml on MaxiLR/SwiftCrawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
swiftcrawl-0.1.1.tar.gz -
Subject digest:
9169b7162f1a2c8d45a42669aa3da7a7c138e4a2d5168bcbaf25e2f391544400 - Sigstore transparency entry: 707940096
- Sigstore integration time:
-
Permalink:
MaxiLR/SwiftCrawl@8fded87a93b9f9e0960d91078c83279a0c11b0b4 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/MaxiLR
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8fded87a93b9f9e0960d91078c83279a0c11b0b4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file swiftcrawl-0.1.1-py3-none-any.whl.
File metadata
- Download URL: swiftcrawl-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fd0b9bc8ed82e3445182f212f9e46af0d000883e5b67a87603cbaab3ffa2a8f
|
|
| MD5 |
259f25f32d446e7e42902269d6d4b620
|
|
| BLAKE2b-256 |
af7e2c714eeb87d7b4ffe29ef4b7c7de2a4d9755e85ba1c92bed259671c39a6a
|
Provenance
The following attestation bundles were made for swiftcrawl-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on MaxiLR/SwiftCrawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
swiftcrawl-0.1.1-py3-none-any.whl -
Subject digest:
7fd0b9bc8ed82e3445182f212f9e46af0d000883e5b67a87603cbaab3ffa2a8f - Sigstore transparency entry: 707940099
- Sigstore integration time:
-
Permalink:
MaxiLR/SwiftCrawl@8fded87a93b9f9e0960d91078c83279a0c11b0b4 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/MaxiLR
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8fded87a93b9f9e0960d91078c83279a0c11b0b4 -
Trigger Event:
push
-
Statement type: