Skip to main content

A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities

Project description

SwiftCrawl

A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities.

Highlights

  • Dual-mode sessionsSwiftCrawl seamlessly switches between BrowserForge-powered HTTP requests and Camoufox browser automation.
  • Async-first architecture – every client, crawler component, and CLI workflow is asyncio friendly for massive concurrency.
  • Crawler engine – Scrapy-inspired scheduler, downloader, and CLI (swiftcrawl crawl) with retries, priorities, and Playwright warmup support.
  • Items & Fields – define strongly-typed Item objects with .Field(serializer=...) hooks for clean output serialization.
  • Project bootstrapperswiftcrawl init <project> scaffolds spiders, settings, and sample items in seconds.
  • Unified response parsingResponse.json()/soup()/tree() keep parsing ergonomic across HTTP and browser modes.

Installation

# Initialize with UV (recommended)
uv init --name myproject
cd myproject

# Add SwiftCrawl
uv add swiftcrawl

# Install Camoufox browser
camoufox fetch

Quick Start

HTTP Mode (Fast & Stealthy)

import asyncio
from swiftcrawl import SwiftCrawl

async def main():
    async with SwiftCrawl(method='http') as session:
        response = await session.get('https://api.example.com/data')
        data = response.json()
        print(data)

asyncio.run(main())

Browser Mode (Full JS Support)

import asyncio
from swiftcrawl import SwiftCrawl

async def main():
    async with SwiftCrawl(method='browser', headless=True) as session:
        response = await session.get('https://spa-website.com')

        # Parse with BeautifulSoup
        soup = response.soup()
        title = soup.find('title').string

        # Or use XPath
        tree = response.tree()
        links = tree.xpath('//a/@href')

        print(f"Title: {title}")
        print(f"Links: {links}")

asyncio.run(main())

Usage Examples

HTTP GET with Custom Headers

async with SwiftCrawl(method='http') as session:
    response = await session.get(
        'https://api.example.com',
        headers={'Authorization': 'Bearer token123'}
    )
    print(response.json())

HTTP POST

async with SwiftCrawl(method='http') as session:
    response = await session.post(
        'https://api.example.com/submit',
        json={'key': 'value'}
    )
    print(response.status_code)

Browser GET with Initial URL (Cookie Gathering)

# Visit initial_url first to gather session cookies
async with SwiftCrawl(
    method='browser',
    initial_url='https://example.com/login',
    headless=True
) as session:
    # Subsequent requests will have cookies from initial_url
    response = await session.get('https://example.com/protected')
    print(response.text)

Browser POST via fetch()

# Uses page.evaluate() with fetch() for fast POST requests
async with SwiftCrawl(method='browser', headless=True) as session:
    response = await session.post(
        'https://api.example.com/endpoint',
        data={'username': 'test', 'password': 'secret'}
    )
    print(response.json())

With Proxy

# HTTP mode
async with SwiftCrawl(
    method='http',
    proxy='http://proxy.example.com:8080'
) as session:
    response = await session.get('https://example.com')

# Browser mode
async with SwiftCrawl(
    method='browser',
    proxy={'server': 'http://proxy.example.com:8080',
           'username': 'user',
           'password': 'pass'},
    geoip=True  # Auto-detect location from proxy
) as session:
    response = await session.get('https://example.com')

Browser Warmup Function

The warmup parameter allows you to run a function after the browser initializes but before your main requests. This is perfect for login flows, gathering tokens, or setting up sessions.

async def my_warmup(page):
    """
    Warmup function receives the Playwright page object.
    Use it to login, set cookies, gather tokens, etc.
    """
    await page.goto('https://example.com/login')

    # Set authentication cookies
    await page.evaluate('''() => {
        document.cookie = "auth_token=xyz123; path=/";
        document.cookie = "session_id=abc789; path=/";
    }''')

    print("Logged in and ready!")

# Use warmup with browser mode
async with SwiftCrawl(
    method='browser',
    warmup=my_warmup  # Executes before main requests
) as session:
    # Warmup already executed - we have auth cookies now
    response = await session.get('https://example.com/protected')
    print(response.text)

Key Benefits:

  • Automatic login before scraping
  • Gather CSRF tokens or API keys
  • Set cookies and session data
  • Execute complex multi-step setup
  • Access full Playwright page object

Scrapy-like Crawler & CLI

SwiftCrawl now ships with a Scrapy-inspired crawler stack and command-line interface.

Defining a Spider

from urllib.parse import urljoin

from swiftcrawl import Request, Spider


class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]
    method = "http"  # default, but you can override per domain/URL

    async def parse(self, response):
        soup = response.soup()
        for quote in soup.select(".quote"):
            yield {
                "text": quote.select_one(".text").text,
                "author": quote.select_one(".author").text,
            }

        next_link = soup.select_one(".next a")
        if next_link:
            yield Request(
                url=urljoin(response.url, next_link["href"]),
                callback=self.parse,
            )

Running from Python

import asyncio
from swiftcrawl import run_spider


items = run_spider(QuotesSpider)
print(items)

Running from the CLI

Create spiders/quotes_spider.py containing your spider, then run:

# Print stats only
swiftcrawl crawl quotes

# Persist results
swiftcrawl crawl quotes -o output.jsonl

# Enable verbose logging / stack traces
swiftcrawl crawl quotes -o output.jsonl -v

The CLI automatically loads settings.py (if present), discovers spiders from the spiders/ package, prints crawl statistics, and—when -o/--output is provided—writes scraped items to the specified .json or .jsonl file. .json outputs are standard JSON arrays (each item on a single line for easy diffs), while .jsonl outputs remain newline-delimited for streaming. Use -v/--verbose to see detailed request processing, item writes, and full error traces.

Bootstrapping a Project

Need a fresh workspace? Use the built-in initializer:

swiftcrawl init my_scraper
cd my_scraper
swiftcrawl crawl example

init creates spiders/, a sample spider with Items, and a starter settings.py so you can begin crawling immediately.

Item & Field API

Scraped data can be represented as structured Items with optional serialization hooks.

from swiftcrawl import Item, Field


class QuoteItem(Item):
    text = Field()
    author = Field()
    tags = Field(default_factory=list, serializer=lambda values: ",".join(values))


class QuotesSpider(Spider):
    ...

    async def parse(self, response):
        for quote in response.soup().select('.quote'):
            yield QuoteItem(
                text=quote.select_one('.text').text,
                author=quote.select_one('.author').text,
                tags=[t.text for t in quote.select('.tag')],
            )

Items automatically convert to dictionaries (using field serializers) before the crawler writes them to disk.

Response Parsing Methods

async with SwiftCrawl(method='http') as session:
    response = await session.get('https://example.com')

    # Raw text
    html = response.text

    # JSON parsing
    data = response.json()

    # BeautifulSoup
    soup = response.soup()
    title = soup.find('title').string

    # lxml XPath
    tree = response.tree()
    paragraphs = tree.xpath('//p/text()')

    # Metadata
    print(response.status_code)
    print(response.headers)
    print(response.cookies)
    print(response.url)

Configuration Options

SwiftCrawl Constructor

SwiftCrawl(
    method='http',           # 'http', 'browser', or 'auto' (future)
    proxy=None,              # Proxy URL or config dict
    headless=True,           # Browser headless mode
    block_images=True,       # Block images in browser
    humanize=None,           # Human-like behavior (0.0-2.0)
    initial_url=None,        # URL to visit first (browser only)
    warmup=None,             # Async function(page) for browser setup
    locale='en-US',          # Browser locale
    os=['windows', 'macos'], # OS fingerprint options
    geoip=False,             # Auto-geolocate from proxy
    timeout=30.0,            # Request timeout (HTTP)
    max_concurrent=10,       # Queue concurrency limit
)

HTTP Mode Options (BrowserForge + httpx)

  • Generates realistic browser headers automatically
  • Rotates fingerprints between requests
  • Supports all standard httpx parameters

Browser Mode Options (Camoufox)

  • headless: Run in headless mode (default: True)
  • block_images: Block image loading for speed (default: True)
  • humanize: Enable human-like cursor movement (0.0-2.0)
  • initial_url: Navigate here first to collect cookies/session
  • warmup: Async function that receives the page object for setup (login, cookies, etc.)
  • geoip: Auto-detect geolocation from proxy IP
  • locale: Browser locale (default: 'en-US')
  • os: List of OS to randomly choose from

Parameter Validation

SwiftCrawl validates parameter compatibility and warns you about configuration mistakes:

Errors (ValueError)

Raised when parameters are fundamentally incompatible:

# ERROR: warmup requires browser mode
session = SwiftCrawl(method='http', warmup=my_warmup)
# ValueError: warmup parameter is only supported for 'browser' and 'auto' methods.

Warnings (UserWarning)

Issued when parameters will be ignored:

# WARNING: browser params with HTTP mode
session = SwiftCrawl(
    method='http',
    headless=False,  # Ignored in HTTP mode
    humanize=1.5     # Ignored in HTTP mode
)
# UserWarning: Browser-only parameters ['headless', 'humanize'] are ignored in HTTP mode.

# WARNING: timeout with browser mode
session = SwiftCrawl(method='browser', timeout=10.0)
# UserWarning: HTTP timeout parameter is ignored in browser mode.

This helps catch configuration mistakes early and ensures you understand which parameters are being used.

Architecture

SwiftCrawl
   method='http' -> AsyncHTTPClient (httpx + BrowserForge)
   method='browser' -> AsyncBrowserClient (Camoufox + Playwright)
   method='auto' -> Smart detection (coming soon)

Response Object
   .text / .html -> Raw content
   .json() -> JSON parsing
   .soup() -> BeautifulSoup (html.parser)
   .tree() -> lxml tree (XPath)
   .headers, .cookies, .status_code -> Metadata

Roadmap

  • HTTP mode with BrowserForge headers
  • Browser mode with Camoufox
  • Browser POST via page.evaluate(fetch())
  • Session/cookie management with initial_url
  • Warmup function for browser initialization
  • HTML-wrapped JSON parsing fix
  • Parameter validation and warnings
  • Unified Response object
  • Auto mode (intelligent method selection)
  • AsyncIO request queue for bulk processing
  • Rate limiting and retry logic
  • Middleware system
  • Built-in proxy rotation

Dependencies

  • httpx - Async HTTP client
  • browserforge - Browser fingerprint generation
  • camoufox - Anti-detection browser
  • playwright - Browser automation (via camoufox)
  • beautifulsoup4 - HTML parsing
  • lxml - XPath support

Testing

# Run fast, offline-safe suite
uv run pytest

# Include network + browser integration tests (needs internet & Playwright)
EASYSCRAPER_RUN_NETWORK_TESTS=1 uv run pytest

License

MIT

Third-Party Licenses

SwiftCrawl depends on the following open-source libraries. We are grateful to their maintainers and contributors:

Library License Repository
beautifulsoup4 MIT https://www.crummy.com/software/BeautifulSoup/
browserforge Apache-2.0 https://github.com/daijro/browserforge
camoufox MPL-2.0 https://github.com/daijro/camoufox
httpx BSD-3-Clause https://github.com/encode/httpx
lxml BSD-3-Clause https://github.com/lxml/lxml
playwright Apache-2.0 https://github.com/microsoft/playwright-python

All licenses require attribution. Please review each library's license for specific terms.

Contributing

Contributions are welcome! This is an early-stage project designed for flexible web scraping with anti-detection capabilities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swiftcrawl-0.1.1.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swiftcrawl-0.1.1-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file swiftcrawl-0.1.1.tar.gz.

File metadata

  • Download URL: swiftcrawl-0.1.1.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swiftcrawl-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9169b7162f1a2c8d45a42669aa3da7a7c138e4a2d5168bcbaf25e2f391544400
MD5 616f01c6fb2d2bdf06c3f434de983f02
BLAKE2b-256 bc43255a041dd1b9824369218837eb5352f968fcdb7bb5ef712adcd1a49d99e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for swiftcrawl-0.1.1.tar.gz:

Publisher: publish.yml on MaxiLR/SwiftCrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file swiftcrawl-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: swiftcrawl-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swiftcrawl-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7fd0b9bc8ed82e3445182f212f9e46af0d000883e5b67a87603cbaab3ffa2a8f
MD5 259f25f32d446e7e42902269d6d4b620
BLAKE2b-256 af7e2c714eeb87d7b4ffe29ef4b7c7de2a4d9755e85ba1c92bed259671c39a6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for swiftcrawl-0.1.1-py3-none-any.whl:

Publisher: publish.yml on MaxiLR/SwiftCrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page