A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities

Project description

SwiftCrawl

A powerful, flexible web scraping abstraction layer that seamlessly handles both lightweight HTTP requests and full browser automation with anti-detection capabilities.

Highlights

Dual-mode sessions – SwiftCrawl seamlessly switches between BrowserForge-powered HTTP requests and Camoufox browser automation.
Async-first architecture – every client, crawler component, and CLI workflow is asyncio friendly for massive concurrency.
Crawler engine – Scrapy-inspired scheduler, downloader, and CLI (swiftcrawl crawl) with retries, priorities, and Playwright warmup support.
Items & Fields – define strongly-typed Item objects with .Field(serializer=...) hooks for clean output serialization.
Project bootstrapper – swiftcrawl init <project> scaffolds spiders, settings, and sample items in seconds.
Unified response parsing – Response.json()/soup()/tree() keep parsing ergonomic across HTTP and browser modes.

Installation

# Initialize with UV (recommended)
uv init --name myproject
cd myproject

# Add SwiftCrawl
uv add swiftcrawl

# Install Camoufox browser
camoufox fetch

Quick Start

HTTP Mode (Fast & Stealthy)

import asyncio
from swiftcrawl import SwiftCrawl

async def main():
    async with SwiftCrawl(method='http') as session:
        response = await session.get('https://api.example.com/data')
        data = response.json()
        print(data)

asyncio.run(main())

Browser Mode (Full JS Support)

import asyncio
from swiftcrawl import SwiftCrawl

async def main():
    async with SwiftCrawl(method='browser', headless=True) as session:
        response = await session.get('https://spa-website.com')

        # Parse with BeautifulSoup
        soup = response.soup()
        title = soup.find('title').string

        # Or use XPath
        tree = response.tree()
        links = tree.xpath('//a/@href')

        print(f"Title: {title}")
        print(f"Links: {links}")

asyncio.run(main())

Usage Examples

HTTP GET with Custom Headers

async with SwiftCrawl(method='http') as session:
    response = await session.get(
        'https://api.example.com',
        headers={'Authorization': 'Bearer token123'}
    )
    print(response.json())

HTTP POST

async with SwiftCrawl(method='http') as session:
    response = await session.post(
        'https://api.example.com/submit',
        json={'key': 'value'}
    )
    print(response.status_code)

Browser GET with Initial URL (Cookie Gathering)

# Visit initial_url first to gather session cookies
async with SwiftCrawl(
    method='browser',
    initial_url='https://example.com/login',
    headless=True
) as session:
    # Subsequent requests will have cookies from initial_url
    response = await session.get('https://example.com/protected')
    print(response.text)

Browser POST via fetch()

# Uses page.evaluate() with fetch() for fast POST requests
async with SwiftCrawl(method='browser', headless=True) as session:
    response = await session.post(
        'https://api.example.com/endpoint',
        data={'username': 'test', 'password': 'secret'}
    )
    print(response.json())

With Proxy

# HTTP mode
async with SwiftCrawl(
    method='http',
    proxy='http://proxy.example.com:8080'
) as session:
    response = await session.get('https://example.com')

# Browser mode
async with SwiftCrawl(
    method='browser',
    proxy={'server': 'http://proxy.example.com:8080',
           'username': 'user',
           'password': 'pass'},
    geoip=True  # Auto-detect location from proxy
) as session:
    response = await session.get('https://example.com')

Browser Warmup Function

The warmup parameter allows you to run a function after the browser initializes but before your main requests. This is perfect for login flows, gathering tokens, or setting up sessions.

async def my_warmup(page):
    """
    Warmup function receives the Playwright page object.
    Use it to login, set cookies, gather tokens, etc.
    """
    await page.goto('https://example.com/login')

    # Set authentication cookies
    await page.evaluate('''() => {
        document.cookie = "auth_token=xyz123; path=/";
        document.cookie = "session_id=abc789; path=/";
    }''')

    print("Logged in and ready!")

# Use warmup with browser mode
async with SwiftCrawl(
    method='browser',
    warmup=my_warmup  # Executes before main requests
) as session:
    # Warmup already executed - we have auth cookies now
    response = await session.get('https://example.com/protected')
    print(response.text)

Key Benefits:

Automatic login before scraping
Gather CSRF tokens or API keys
Set cookies and session data
Execute complex multi-step setup
Access full Playwright page object

Scrapy-like Crawler & CLI

SwiftCrawl now ships with a Scrapy-inspired crawler stack and command-line interface.

Defining a Spider

from urllib.parse import urljoin

from swiftcrawl import Request, Spider


class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]
    method = "http"  # default, but you can override per domain/URL

    async def parse(self, response):
        soup = response.soup()
        for quote in soup.select(".quote"):
            yield {
                "text": quote.select_one(".text").text,
                "author": quote.select_one(".author").text,
            }

        next_link = soup.select_one(".next a")
        if next_link:
            yield Request(
                url=urljoin(response.url, next_link["href"]),
                callback=self.parse,
            )

Running from Python

import asyncio
from swiftcrawl import run_spider


items = run_spider(QuotesSpider)
print(items)

Running from the CLI

Create spiders/quotes_spider.py containing your spider, then run:

# Print stats only
swiftcrawl crawl quotes

# Persist results
swiftcrawl crawl quotes -o output.jsonl

# Enable verbose logging / stack traces
swiftcrawl crawl quotes -o output.jsonl -v

The CLI automatically loads settings.py (if present), discovers spiders from the spiders/ package, prints crawl statistics, and—when -o/--output is provided—writes scraped items to the specified .json or .jsonl file. .json outputs are standard JSON arrays (each item on a single line for easy diffs), while .jsonl outputs remain newline-delimited for streaming. Use -v/--verbose to see detailed request processing, item writes, and full error traces.

Bootstrapping a Project

Need a fresh workspace? Use the built-in initializer:

swiftcrawl init my_scraper
cd my_scraper
swiftcrawl crawl example

init creates spiders/, a sample spider with Items, and a starter settings.py so you can begin crawling immediately.

Item & Field API

Scraped data can be represented as structured Items with optional serialization hooks.

from swiftcrawl import Item, Field


class QuoteItem(Item):
    text = Field()
    author = Field()
    tags = Field(default_factory=list, serializer=lambda values: ",".join(values))


class QuotesSpider(Spider):
    ...

    async def parse(self, response):
        for quote in response.soup().select('.quote'):
            yield QuoteItem(
                text=quote.select_one('.text').text,
                author=quote.select_one('.author').text,
                tags=[t.text for t in quote.select('.tag')],
            )

Items automatically convert to dictionaries (using field serializers) before the crawler writes them to disk.

Response Parsing Methods

async with SwiftCrawl(method='http') as session:
    response = await session.get('https://example.com')

    # Raw text
    html = response.text

    # JSON parsing
    data = response.json()

    # BeautifulSoup
    soup = response.soup()
    title = soup.find('title').string

    # lxml XPath
    tree = response.tree()
    paragraphs = tree.xpath('//p/text()')

    # Metadata
    print(response.status_code)
    print(response.headers)
    print(response.cookies)
    print(response.url)

Configuration Options

SwiftCrawl Constructor

SwiftCrawl(
    method='http',           # 'http', 'browser', or 'auto' (future)
    proxy=None,              # Proxy URL or config dict
    headless=True,           # Browser headless mode
    block_images=True,       # Block images in browser
    humanize=None,           # Human-like behavior (0.0-2.0)
    initial_url=None,        # URL to visit first (browser only)
    warmup=None,             # Async function(page) for browser setup
    locale='en-US',          # Browser locale
    os=['windows', 'macos'], # OS fingerprint options
    geoip=False,             # Auto-geolocate from proxy
    timeout=30.0,            # Request timeout (HTTP)
    max_concurrent=10,       # Queue concurrency limit
)

HTTP Mode Options (BrowserForge + httpx)

Generates realistic browser headers automatically
Rotates fingerprints between requests
Supports all standard httpx parameters

Browser Mode Options (Camoufox)

headless: Run in headless mode (default: True)
block_images: Block image loading for speed (default: True)
humanize: Enable human-like cursor movement (0.0-2.0)
initial_url: Navigate here first to collect cookies/session
warmup: Async function that receives the page object for setup (login, cookies, etc.)
geoip: Auto-detect geolocation from proxy IP
locale: Browser locale (default: 'en-US')
os: List of OS to randomly choose from

Parameter Validation

SwiftCrawl validates parameter compatibility and warns you about configuration mistakes:

Errors (ValueError)

Raised when parameters are fundamentally incompatible:

# ERROR: warmup requires browser mode
session = SwiftCrawl(method='http', warmup=my_warmup)
# ValueError: warmup parameter is only supported for 'browser' and 'auto' methods.

Warnings (UserWarning)

Issued when parameters will be ignored:

# WARNING: browser params with HTTP mode
session = SwiftCrawl(
    method='http',
    headless=False,  # Ignored in HTTP mode
    humanize=1.5     # Ignored in HTTP mode
)
# UserWarning: Browser-only parameters ['headless', 'humanize'] are ignored in HTTP mode.

# WARNING: timeout with browser mode
session = SwiftCrawl(method='browser', timeout=10.0)
# UserWarning: HTTP timeout parameter is ignored in browser mode.

This helps catch configuration mistakes early and ensures you understand which parameters are being used.

Architecture

SwiftCrawl
   method='http' -> AsyncHTTPClient (httpx + BrowserForge)
   method='browser' -> AsyncBrowserClient (Camoufox + Playwright)
   method='auto' -> Smart detection (coming soon)

Response Object
   .text / .html -> Raw content
   .json() -> JSON parsing
   .soup() -> BeautifulSoup (html.parser)
   .tree() -> lxml tree (XPath)
   .headers, .cookies, .status_code -> Metadata

Roadmap

HTTP mode with BrowserForge headers
Browser mode with Camoufox
Browser POST via page.evaluate(fetch())
Session/cookie management with initial_url
Warmup function for browser initialization
HTML-wrapped JSON parsing fix
Parameter validation and warnings
Unified Response object
Auto mode (intelligent method selection)
AsyncIO request queue for bulk processing
Rate limiting and retry logic
Middleware system
Built-in proxy rotation

Dependencies

httpx - Async HTTP client
browserforge - Browser fingerprint generation
camoufox - Anti-detection browser
playwright - Browser automation (via camoufox)
beautifulsoup4 - HTML parsing
lxml - XPath support

Testing

# Run fast, offline-safe suite
uv run pytest

# Include network + browser integration tests (needs internet & Playwright)
EASYSCRAPER_RUN_NETWORK_TESTS=1 uv run pytest

License

MIT

Third-Party Licenses

SwiftCrawl depends on the following open-source libraries. We are grateful to their maintainers and contributors:

Library	License	Repository
beautifulsoup4	MIT	https://www.crummy.com/software/BeautifulSoup/
browserforge	Apache-2.0	https://github.com/daijro/browserforge
camoufox	MPL-2.0	https://github.com/daijro/camoufox
httpx	BSD-3-Clause	https://github.com/encode/httpx
lxml	BSD-3-Clause	https://github.com/lxml/lxml
playwright	Apache-2.0	https://github.com/microsoft/playwright-python

All licenses require attribution. Please review each library's license for specific terms.

Contributing

Contributions are welcome! This is an early-stage project designed for flexible web scraping with anti-detection capabilities.

Project details

Release history Release notifications | RSS feed

0.1.5

Nov 23, 2025

0.1.4

Nov 23, 2025

0.1.3

Nov 19, 2025

0.1.2

Nov 18, 2025

This version

0.1.1

Nov 18, 2025

0.1.0

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swiftcrawl-0.1.1.tar.gz (21.3 kB view details)

Uploaded Nov 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

swiftcrawl-0.1.1-py3-none-any.whl (28.2 kB view details)

Uploaded Nov 18, 2025 Python 3

File details

Details for the file swiftcrawl-0.1.1.tar.gz.

File metadata

Download URL: swiftcrawl-0.1.1.tar.gz
Upload date: Nov 18, 2025
Size: 21.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swiftcrawl-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`9169b7162f1a2c8d45a42669aa3da7a7c138e4a2d5168bcbaf25e2f391544400`
MD5	`616f01c6fb2d2bdf06c3f434de983f02`
BLAKE2b-256	`bc43255a041dd1b9824369218837eb5352f968fcdb7bb5ef712adcd1a49d99e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for swiftcrawl-0.1.1.tar.gz:

Publisher: publish.yml on MaxiLR/SwiftCrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: swiftcrawl-0.1.1.tar.gz
- Subject digest: 9169b7162f1a2c8d45a42669aa3da7a7c138e4a2d5168bcbaf25e2f391544400
- Sigstore transparency entry: 707940096
- Sigstore integration time: Nov 18, 2025
Source repository:
- Permalink: MaxiLR/SwiftCrawl@8fded87a93b9f9e0960d91078c83279a0c11b0b4
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/MaxiLR
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8fded87a93b9f9e0960d91078c83279a0c11b0b4
- Trigger Event: push

File details

Details for the file swiftcrawl-0.1.1-py3-none-any.whl.

File metadata

Download URL: swiftcrawl-0.1.1-py3-none-any.whl
Upload date: Nov 18, 2025
Size: 28.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swiftcrawl-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7fd0b9bc8ed82e3445182f212f9e46af0d000883e5b67a87603cbaab3ffa2a8f`
MD5	`259f25f32d446e7e42902269d6d4b620`
BLAKE2b-256	`af7e2c714eeb87d7b4ffe29ef4b7c7de2a4d9755e85ba1c92bed259671c39a6a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for swiftcrawl-0.1.1-py3-none-any.whl:

Publisher: publish.yml on MaxiLR/SwiftCrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: swiftcrawl-0.1.1-py3-none-any.whl
- Subject digest: 7fd0b9bc8ed82e3445182f212f9e46af0d000883e5b67a87603cbaab3ffa2a8f
- Sigstore transparency entry: 707940099
- Sigstore integration time: Nov 18, 2025
Source repository:
- Permalink: MaxiLR/SwiftCrawl@8fded87a93b9f9e0960d91078c83279a0c11b0b4
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/MaxiLR
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8fded87a93b9f9e0960d91078c83279a0c11b0b4
- Trigger Event: push

swiftcrawl 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

SwiftCrawl

Highlights

Installation

Quick Start

HTTP Mode (Fast & Stealthy)

Browser Mode (Full JS Support)

Usage Examples

HTTP GET with Custom Headers

HTTP POST

Browser GET with Initial URL (Cookie Gathering)

Browser POST via fetch()

With Proxy

Browser Warmup Function

Scrapy-like Crawler & CLI

Defining a Spider

Running from Python

Running from the CLI

Bootstrapping a Project

Item & Field API

Response Parsing Methods

Configuration Options

SwiftCrawl Constructor

HTTP Mode Options (BrowserForge + httpx)

Browser Mode Options (Camoufox)

Parameter Validation

Errors (ValueError)

Warnings (UserWarning)

Architecture

Roadmap

Dependencies

Testing

License

Third-Party Licenses

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance