Skip to main content

A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities

Project description

SwiftCrawl

A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities.

Highlights

  • Dual-mode sessionsSwiftCrawl seamlessly switches between BrowserForge-powered HTTP requests and Camoufox browser automation.
  • Async-first architecture – every client, crawler component, and CLI workflow is asyncio friendly for massive concurrency.
  • Crawler engine – Scrapy-inspired scheduler, downloader, and CLI (swiftcrawl crawl) with retries, priorities, and Playwright warmup support.
  • Items & Fields – define strongly-typed Item objects with .Field(serializer=...) hooks for clean output serialization.
  • Project bootstrapperswiftcrawl init <project> scaffolds spiders, settings, and sample items in seconds.
  • Unified response parsingResponse.json()/soup()/tree() keep parsing ergonomic across HTTP and browser modes.

Installation

# Initialize with UV (recommended)
uv init --name myproject
cd myproject

# Add SwiftCrawl
uv add swiftcrawl

# Install Camoufox browser
camoufox fetch

Quick Start

HTTP Mode (Fast & Stealthy)

import asyncio
from swiftcrawl import SwiftCrawl

async def main():
    async with SwiftCrawl(method='http') as session:
        response = await session.get('https://api.example.com/data')
        data = response.json()
        print(data)

asyncio.run(main())

Browser Mode (Full JS Support)

import asyncio
from swiftcrawl import SwiftCrawl

async def main():
    async with SwiftCrawl(method='browser', headless=True) as session:
        response = await session.get('https://spa-website.com')

        # Parse with BeautifulSoup
        soup = response.soup()
        title = soup.find('title').string

        # Or use XPath
        tree = response.tree()
        links = tree.xpath('//a/@href')

        print(f"Title: {title}")
        print(f"Links: {links}")

asyncio.run(main())

Usage Examples

HTTP GET with Custom Headers

async with SwiftCrawl(method='http') as session:
    response = await session.get(
        'https://api.example.com',
        headers={'Authorization': 'Bearer token123'}
    )
    print(response.json())

HTTP POST

async with SwiftCrawl(method='http') as session:
    response = await session.post(
        'https://api.example.com/submit',
        json={'key': 'value'}
    )
    print(response.status_code)

Browser GET with Initial URL (Cookie Gathering)

# Visit initial_url first to gather session cookies
async with SwiftCrawl(
    method='browser',
    initial_url='https://example.com/login',
    headless=True
) as session:
    # Subsequent requests will have cookies from initial_url
    response = await session.get('https://example.com/protected')
    print(response.text)

Browser POST via fetch()

# Uses page.evaluate() with fetch() for fast POST requests
async with SwiftCrawl(method='browser', headless=True) as session:
    response = await session.post(
        'https://api.example.com/endpoint',
        data={'username': 'test', 'password': 'secret'}
    )
    print(response.json())

With Proxy

# HTTP mode
async with SwiftCrawl(
    method='http',
    proxy='http://proxy.example.com:8080'
) as session:
    response = await session.get('https://example.com')

# Browser mode
async with SwiftCrawl(
    method='browser',
    proxy={'server': 'http://proxy.example.com:8080',
           'username': 'user',
           'password': 'pass'},
    geoip=True  # Auto-detect location from proxy
) as session:
    response = await session.get('https://example.com')

Browser Warmup Function

The warmup parameter allows you to run a function after the browser initializes but before your main requests. This is perfect for login flows, gathering tokens, or setting up sessions.

async def my_warmup(page):
    """
    Warmup function receives the Playwright page object.
    Use it to login, set cookies, gather tokens, etc.
    """
    await page.goto('https://example.com/login')

    # Set authentication cookies
    await page.evaluate('''() => {
        document.cookie = "auth_token=xyz123; path=/";
        document.cookie = "session_id=abc789; path=/";
    }''')

    print("Logged in and ready!")

# Use warmup with browser mode
async with SwiftCrawl(
    method='browser',
    warmup=my_warmup  # Executes before main requests
) as session:
    # Warmup already executed - we have auth cookies now
    response = await session.get('https://example.com/protected')
    print(response.text)

Key Benefits:

  • Automatic login before scraping
  • Gather CSRF tokens or API keys
  • Set cookies and session data
  • Execute complex multi-step setup
  • Access full Playwright page object

Scrapy-like Crawler & CLI

SwiftCrawl now ships with a Scrapy-inspired crawler stack and command-line interface.

Defining a Spider

from urllib.parse import urljoin

from swiftcrawl import Request, Spider


class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]
    method = "http"  # default, but you can override per domain/URL

    async def parse(self, response):
        soup = response.soup()
        for quote in soup.select(".quote"):
            yield {
                "text": quote.select_one(".text").text,
                "author": quote.select_one(".author").text,
            }

        next_link = soup.select_one(".next a")
        if next_link:
            yield Request(
                url=urljoin(response.url, next_link["href"]),
                callback=self.parse,
            )

Running from Python

import asyncio
from swiftcrawl import run_spider


items = run_spider(QuotesSpider)
print(items)

Running from the CLI

Create spiders/quotes_spider.py containing your spider, then run:

# Print stats only
swiftcrawl crawl quotes

# Persist results
swiftcrawl crawl quotes -o output.jsonl

# Enable verbose logging / stack traces
swiftcrawl crawl quotes -o output.jsonl -v

The CLI automatically loads settings.py (if present), discovers spiders from the spiders/ package, prints crawl statistics, and—when -o/--output is provided—writes scraped items to the specified .json or .jsonl file. .json outputs are standard JSON arrays (each item on a single line for easy diffs), while .jsonl outputs remain newline-delimited for streaming. Use -v/--verbose to see detailed request processing, item writes, and full error traces.

Bootstrapping a Project

Need a fresh workspace? Use the built-in initializer:

swiftcrawl init my_scraper
cd my_scraper
swiftcrawl crawl example

init creates spiders/, a sample spider with Items, and a starter settings.py so you can begin crawling immediately.

Item & Field API

Scraped data can be represented as structured Items with optional serialization hooks.

from swiftcrawl import Item, Field


class QuoteItem(Item):
    text = Field()
    author = Field()
    tags = Field(default_factory=list, serializer=lambda values: ",".join(values))


class QuotesSpider(Spider):
    ...

    async def parse(self, response):
        for quote in response.soup().select('.quote'):
            yield QuoteItem(
                text=quote.select_one('.text').text,
                author=quote.select_one('.author').text,
                tags=[t.text for t in quote.select('.tag')],
            )

Items automatically convert to dictionaries (using field serializers) before the crawler writes them to disk.

Response Parsing Methods

async with SwiftCrawl(method='http') as session:
    response = await session.get('https://example.com')

    # Raw text
    html = response.text

    # JSON parsing
    data = response.json()

    # BeautifulSoup
    soup = response.soup()
    title = soup.find('title').string

    # lxml XPath
    tree = response.tree()
    paragraphs = tree.xpath('//p/text()')

    # Metadata
    print(response.status_code)
    print(response.headers)
    print(response.cookies)
    print(response.url)

Configuration Options

SwiftCrawl Constructor

SwiftCrawl(
    method='http',           # 'http', 'browser', or 'auto' (future)
    proxy=None,              # Proxy URL or config dict
    headless=True,           # Browser headless mode
    block_images=True,       # Block images in browser
    humanize=None,           # Human-like behavior (0.0-2.0)
    initial_url=None,        # URL to visit first (browser only)
    warmup=None,             # Async function(page) for browser setup
    locale='en-US',          # Browser locale
    os=['windows', 'macos'], # OS fingerprint options
    geoip=False,             # Auto-geolocate from proxy
    timeout=30.0,            # Request timeout (HTTP)
    max_concurrent=10,       # Queue concurrency limit
)

HTTP Mode Options (BrowserForge + httpx)

  • Generates realistic browser headers automatically
  • Rotates fingerprints between requests
  • Supports all standard httpx parameters

Browser Mode Options (Camoufox)

  • headless: Run in headless mode (default: True)
  • block_images: Block image loading for speed (default: True)
  • humanize: Enable human-like cursor movement (0.0-2.0)
  • initial_url: Navigate here first to collect cookies/session
  • warmup: Async function that receives the page object for setup (login, cookies, etc.)
  • geoip: Auto-detect geolocation from proxy IP
  • locale: Browser locale (default: 'en-US')
  • os: List of OS to randomly choose from

Parameter Validation

SwiftCrawl validates parameter compatibility and warns you about configuration mistakes:

Errors (ValueError)

Raised when parameters are fundamentally incompatible:

# ERROR: warmup requires browser mode
session = SwiftCrawl(method='http', warmup=my_warmup)
# ValueError: warmup parameter is only supported for 'browser' and 'auto' methods.

Warnings (UserWarning)

Issued when parameters will be ignored:

# WARNING: browser params with HTTP mode
session = SwiftCrawl(
    method='http',
    headless=False,  # Ignored in HTTP mode
    humanize=1.5     # Ignored in HTTP mode
)
# UserWarning: Browser-only parameters ['headless', 'humanize'] are ignored in HTTP mode.

# WARNING: timeout with browser mode
session = SwiftCrawl(method='browser', timeout=10.0)
# UserWarning: HTTP timeout parameter is ignored in browser mode.

This helps catch configuration mistakes early and ensures you understand which parameters are being used.

Architecture

SwiftCrawl
   method='http' -> AsyncHTTPClient (httpx + BrowserForge)
   method='browser' -> AsyncBrowserClient (Camoufox + Playwright)
   method='auto' -> Smart detection (coming soon)

Response Object
   .text / .html -> Raw content
   .json() -> JSON parsing
   .soup() -> BeautifulSoup (html.parser)
   .tree() -> lxml tree (XPath)
   .headers, .cookies, .status_code -> Metadata

Roadmap

  • HTTP mode with BrowserForge headers
  • Browser mode with Camoufox
  • Browser POST via page.evaluate(fetch())
  • Session/cookie management with initial_url
  • Warmup function for browser initialization
  • HTML-wrapped JSON parsing fix
  • Parameter validation and warnings
  • Unified Response object
  • Auto mode (intelligent method selection)
  • AsyncIO request queue for bulk processing
  • Rate limiting and retry logic
  • Middleware system
  • Built-in proxy rotation

Dependencies

  • httpx - Async HTTP client
  • browserforge - Browser fingerprint generation
  • camoufox - Anti-detection browser
  • playwright - Browser automation (via camoufox)
  • beautifulsoup4 - HTML parsing
  • lxml - XPath support

Testing

# Run fast, offline-safe suite
uv run pytest

# Include network + browser integration tests (needs internet & Playwright)
EASYSCRAPER_RUN_NETWORK_TESTS=1 uv run pytest

License

MIT

Contributing

Contributions are welcome! This is an early-stage project designed for flexible web scraping with anti-detection capabilities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swiftcrawl-0.1.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swiftcrawl-0.1.0-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file swiftcrawl-0.1.0.tar.gz.

File metadata

  • Download URL: swiftcrawl-0.1.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swiftcrawl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a4444783c46c5594362251f1f0f615e63547bf60c7d9253d5043374431bcbfb7
MD5 588445a4f19c6be4f547fac12794bfdb
BLAKE2b-256 bc3f3ad5d068513d016b435aed72d95210be472ecc7211a87db6a24af7e99cc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for swiftcrawl-0.1.0.tar.gz:

Publisher: publish.yml on MaxiLR/SwiftCrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file swiftcrawl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: swiftcrawl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swiftcrawl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a0c09a66ac3b0b56ca0ea2200e57667ba95f0054092074da89dd325fa3a2d7f4
MD5 dbb8fbcc3579b05efb3fc3c2c90089aa
BLAKE2b-256 1e874986eb78ab4a3d1a6b34bb710b20f97c04eff39592a187f4f8db33cebd7f

See more details on using hashes here.

Provenance

The following attestation bundles were made for swiftcrawl-0.1.0-py3-none-any.whl:

Publisher: publish.yml on MaxiLR/SwiftCrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page