Skip to main content

Crawlsmith helps you craft reliable web crawlers in Python, combining page fetching, HTML parsing, link discovery, and content extraction into a simple and extensible toolkit.

Project description

Crawlsmith banner

CrawlSmith

Crawlsmith is a Python scraping toolkit for fetching web pages with curl_cffi, extracting readable content, detecting common anti-bot interstitials, and returning structured metadata in a single result object.

It is designed for Python developers who want a small, pragmatic interface for:

  • fetching HTML or XML content
  • converting HTML to Markdown via domdown — turns article-like web pages into clean, structured Markdown with frontmatter, image/table/code preservation, and article body extraction
  • rotating browser impersonation profiles
  • trying multiple proxies
  • classifying HTTP and network failures
  • extracting document, Open Graph, Twitter, and HTTP metadata

Features

  • Async-first Python API built around CurlCffiScraper
  • Structured FetchResult object with success state, content, Markdown, and metadata
  • Automatic browser fingerprint headers and curl_cffi impersonation support
  • Proxy rotation with early success and retry limits
  • Detection of common anti-bot challenge pages such as Cloudflare-style interstitials
  • Gzip payload handling for compressed responses and feeds
  • Built-in CLI for quick fetch, inspection, and debugging

Installation

Install from PyPI:

pip install crawlsmith

Requirements:

  • Python 3.10+

Quick Start

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if result.ok:
        print(result.status)
        print(result.content[:200])
        print(result.markdown[:200])
    else:
        print(result.error_type, result.error)


asyncio.run(main())

Python Usage

Basic Fetch

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if not result.ok:
        raise RuntimeError(f"{result.error_type}: {result.error}")

    print("Status:", result.status)
    print("URL:", result.url)
    print("Content length:", result.content_length)


asyncio.run(main())

Read HTML and Markdown

When a request succeeds with HTTP 200, Crawlsmith returns both the raw response body and a Markdown representation.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if result.ok:
        html = result.content
        markdown = result.markdown
        print(html[:300])
        print(markdown[:300])


asyncio.run(main())

Access Structured Metadata

Each result includes metadata extracted from the response body and headers.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    metadata = result.metadata or {}
    document = metadata.get("document", {})
    open_graph = metadata.get("open_graph", {})
    twitter = metadata.get("twitter", {})
    http = metadata.get("http", {})

    print("Title:", document.get("title"))
    print("Description:", document.get("description"))
    print("Canonical URL:", document.get("canonical_url"))
    print("OG Title:", open_graph.get("title"))
    print("Twitter Card:", twitter.get("card"))
    print("Final URL:", http.get("final_url"))


asyncio.run(main())

Use Proxies

Pass a list of proxies. Crawlsmith will shuffle them, try up to three unique entries, and return as soon as one succeeds with enough content.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper(
        proxies=[
            "http://user:pass@proxy-1.example:8080",
            "http://user:pass@proxy-2.example:8080",
            "proxy-3.example:8080",
        ],
        min_content_length=2000,
    )

    result = await scraper.fetch("https://example.com")
    print(result.ok, result.via_proxy, result.proxy_url)


asyncio.run(main())

Control Browser Impersonation

You can force a specific curl_cffi impersonation profile instead of using the default randomized behavior.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper(impersonate="chrome120")
    result = await scraper.fetch("https://example.com")
    print(result.status, result.error_type)


asyncio.run(main())

Configure TLS and Timeouts

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper(
        verify=True,
        connect_timeout=5,
        read_timeout=20,
    )
    result = await scraper.fetch("https://example.com")
    print(result.to_dict())


asyncio.run(main())

If you need to disable TLS certificate verification for a controlled internal environment, set verify=False.

Handle Errors Explicitly

Failures are returned as structured results instead of raising request errors in normal operation.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if result.ok:
        print("Fetched successfully")
        return

    print("Error type:", result.error_type)
    print("Error message:", result.error)
    print("HTTP status:", result.status)
    print("Blocked:", result.is_blocked)


asyncio.run(main())

Common error types include:

  • TIMEOUT
  • CONNECTION
  • SSL
  • INVALID_URL
  • BLOCKED
  • HTTP_403
  • HTTP_429
  • HTTP_4XX
  • HTTP_5XX
  • UNKNOWN

Serialize Results

FetchResult can be converted directly into a plain dictionary for logging, storage, or JSON serialization.

import asyncio
import json

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")
    print(json.dumps(result.to_dict(), indent=2))


asyncio.run(main())

CLI Usage

The package installs a crawlsmith command for quick fetches from the terminal.

Basic CLI Request

crawlsmith fetch https://example.com

The CLI prints a JSON-serialized FetchResult to stdout.

Print the Response Body

crawlsmith fetch --url https://example.com --print-content

Print Markdown Version

crawlsmith fetch --url https://example.com --print-markdown

The Markdown output includes YAML frontmatter with metadata (title, author, tags, etc.) followed by clean, readable content.

Use One or More Proxies

crawlsmith fetch --url https://example.com \
  --proxy http://user:pass@proxy-1.example:8080 \
  --proxy http://user:pass@proxy-2.example:8080 \
  --min-content-length 2000

Force an Impersonation Profile

crawlsmith fetch --url https://example.com --impersonate chrome120

Change Timeout or Disable TLS Verification

crawlsmith fetch --url https://example.com --timeout 20
crawlsmith fetch --url https://example.com --insecure

CLI Exit Codes

  • 0 when the request succeeds
  • 1 when the request fails

CLI Help

crawlsmith --help
crawlsmith fetch --help

Result Model

FetchResult exposes the following fields:

  • ok: whether the request was considered successful
  • url: requested URL
  • status: HTTP status code when available
  • content: raw response text when successful
  • markdown: Markdown conversion of the response body when successful
  • metadata: extracted document and HTTP metadata
  • error_type: normalized error classification
  • error: human-readable error summary
  • via_proxy: whether the successful or failed attempt used a proxy
  • proxy_url: proxy used for the final attempt, if any
  • content_length: UTF-8 byte length of the extracted text
  • is_blocked: whether the response looks like an anti-bot interstitial

Support & Connect

History

0.2.0 (2026-06-04)

  • Switch from markdownify to domdown for HTML-to-Markdown conversion
    • YAML frontmatter with extracted metadata (title, author, tags, etc.)
    • Article body extraction and image/table/code block preservation
  • Add --print-markdown CLI flag for printing Markdown output
  • Restructure CLI: @click.group() with fetch subcommand
  • Add markdown_length field to FetchResult
  • Update README examples to use crawlsmith fetch --url ...
  • Update tests for CLI and domdown changes

0.1.0 (2026-04-07)

  • First release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlsmith-0.2.0.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlsmith-0.2.0-py2.py3-none-any.whl (13.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file crawlsmith-0.2.0.tar.gz.

File metadata

  • Download URL: crawlsmith-0.2.0.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crawlsmith-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5604755913f5f49afe2427bea34db8b63ceae41c270802d70114f703aa9a7984
MD5 e6c53bcddea6e4b4af8efbb3aca4f230
BLAKE2b-256 8fe6a2982523b24b080d33892b835c019e492a1050edfec75b20711d31d5bc85

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlsmith-0.2.0.tar.gz:

Publisher: publish.yml on juanmcristobal/crawlsmith

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file crawlsmith-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: crawlsmith-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crawlsmith-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 46e6cf65c1f761158cc803b2626030a0bcb2da8b5351754676de2aa7814e30c1
MD5 846d5c64046fd447e0e6eca94a185c72
BLAKE2b-256 ca2050913f25b2aa9d6a0ccf23107fab6aef28e8f3588edba2ac0d9e058bab8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlsmith-0.2.0-py2.py3-none-any.whl:

Publisher: publish.yml on juanmcristobal/crawlsmith

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page