Crawlsmith helps you craft reliable web crawlers in Python, combining page fetching, HTML parsing, link discovery, and content extraction into a simple and extensible toolkit.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

jmcristobal

These details have not been verified by PyPI

Intended Audience
- Developers
Natural Language
- English
Programming Language

Project description

Crawlsmith banner

CrawlSmith

Crawlsmith is a Python scraping toolkit for fetching web pages with curl_cffi, extracting readable content, detecting common anti-bot interstitials, and returning structured metadata in a single result object.

It is designed for Python developers who want a small, pragmatic interface for:

fetching HTML or XML content
converting HTML to Markdown via domdown — turns article-like web pages into clean, structured Markdown with frontmatter, image/table/code preservation, and article body extraction
rotating browser impersonation profiles
trying multiple proxies
classifying HTTP and network failures
extracting document, Open Graph, Twitter, and HTTP metadata

Features

Async-first Python API built around CurlCffiScraper
Structured FetchResult object with success state, content, Markdown, and metadata
Automatic browser fingerprint headers and curl_cffi impersonation support
Proxy rotation with early success and retry limits
Detection of common anti-bot challenge pages such as Cloudflare-style interstitials
Gzip payload handling for compressed responses and feeds
Built-in CLI for quick fetch, inspection, and debugging

Installation

Install from PyPI:

pip install crawlsmith

Requirements:

Python 3.10+

Quick Start

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if result.ok:
        print(result.status)
        print(result.content[:200])
        print(result.markdown[:200])
    else:
        print(result.error_type, result.error)


asyncio.run(main())

Python Usage

Basic Fetch

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if not result.ok:
        raise RuntimeError(f"{result.error_type}: {result.error}")

    print("Status:", result.status)
    print("URL:", result.url)
    print("Content length:", result.content_length)


asyncio.run(main())

Read HTML and Markdown

When a request succeeds with HTTP 200, Crawlsmith returns both the raw response body and a Markdown representation.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if result.ok:
        html = result.content
        markdown = result.markdown
        print(html[:300])
        print(markdown[:300])


asyncio.run(main())

Access Structured Metadata

Each result includes metadata extracted from the response body and headers.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    metadata = result.metadata or {}
    document = metadata.get("document", {})
    open_graph = metadata.get("open_graph", {})
    twitter = metadata.get("twitter", {})
    http = metadata.get("http", {})

    print("Title:", document.get("title"))
    print("Description:", document.get("description"))
    print("Canonical URL:", document.get("canonical_url"))
    print("OG Title:", open_graph.get("title"))
    print("Twitter Card:", twitter.get("card"))
    print("Final URL:", http.get("final_url"))


asyncio.run(main())

Use Proxies

Pass a list of proxies. Crawlsmith will shuffle them, try up to three unique entries, and return as soon as one succeeds with enough content.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper(
        proxies=[
            "http://user:pass@proxy-1.example:8080",
            "http://user:pass@proxy-2.example:8080",
            "proxy-3.example:8080",
        ],
        min_content_length=2000,
    )

    result = await scraper.fetch("https://example.com")
    print(result.ok, result.via_proxy, result.proxy_url)


asyncio.run(main())

Control Browser Impersonation

You can force a specific curl_cffi impersonation profile instead of using the default randomized behavior.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper(impersonate="chrome120")
    result = await scraper.fetch("https://example.com")
    print(result.status, result.error_type)


asyncio.run(main())

Configure TLS and Timeouts

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper(
        verify=True,
        connect_timeout=5,
        read_timeout=20,
    )
    result = await scraper.fetch("https://example.com")
    print(result.to_dict())


asyncio.run(main())

If you need to disable TLS certificate verification for a controlled internal environment, set verify=False.

Handle Errors Explicitly

Failures are returned as structured results instead of raising request errors in normal operation.

import asyncio

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")

    if result.ok:
        print("Fetched successfully")
        return

    print("Error type:", result.error_type)
    print("Error message:", result.error)
    print("HTTP status:", result.status)
    print("Blocked:", result.is_blocked)


asyncio.run(main())

Common error types include:

TIMEOUT
CONNECTION
SSL
INVALID_URL
BLOCKED
HTTP_403
HTTP_429
HTTP_4XX
HTTP_5XX
UNKNOWN

Serialize Results

FetchResult can be converted directly into a plain dictionary for logging, storage, or JSON serialization.

import asyncio
import json

from crawlsmith import CurlCffiScraper


async def main() -> None:
    scraper = CurlCffiScraper()
    result = await scraper.fetch("https://example.com")
    print(json.dumps(result.to_dict(), indent=2))


asyncio.run(main())

CLI Usage

The package installs a crawlsmith command for quick fetches from the terminal.

Basic CLI Request

crawlsmith fetch https://example.com

The CLI prints a JSON-serialized FetchResult to stdout.

Print the Response Body

crawlsmith fetch --url https://example.com --print-content

Print Markdown Version

crawlsmith fetch --url https://example.com --print-markdown

The Markdown output includes YAML frontmatter with metadata (title, author, tags, etc.) followed by clean, readable content.

Use One or More Proxies

crawlsmith fetch --url https://example.com \
  --proxy http://user:pass@proxy-1.example:8080 \
  --proxy http://user:pass@proxy-2.example:8080 \
  --min-content-length 2000

Force an Impersonation Profile

crawlsmith fetch --url https://example.com --impersonate chrome120

Change Timeout or Disable TLS Verification

crawlsmith fetch --url https://example.com --timeout 20

crawlsmith fetch --url https://example.com --insecure

CLI Exit Codes

0 when the request succeeds
1 when the request fails

CLI Help

crawlsmith --help
crawlsmith fetch --help

Result Model

FetchResult exposes the following fields:

ok: whether the request was considered successful
url: requested URL
status: HTTP status code when available
content: raw response text when successful
markdown: Markdown conversion of the response body when successful
metadata: extracted document and HTTP metadata
error_type: normalized error classification
error: human-readable error summary
via_proxy: whether the successful or failed attempt used a proxy
proxy_url: proxy used for the final attempt, if any
content_length: UTF-8 byte length of the extracted text
is_blocked: whether the response looks like an anti-bot interstitial

Support & Connect

⭐ Star the repo if you found it useful
☕ Support me: Say thanks by buying me a coffee! https://buymeacoffee.com/juanmcristobal
💼 Open to work: https://www.linkedin.com/in/jmcristobal/

History

0.2.0 (2026-06-04)

Switch from markdownify to domdown for HTML-to-Markdown conversion
- YAML frontmatter with extracted metadata (title, author, tags, etc.)
- Article body extraction and image/table/code block preservation
Add --print-markdown CLI flag for printing Markdown output
Restructure CLI: @click.group() with fetch subcommand
Add markdown_length field to FetchResult
Update README examples to use crawlsmith fetch --url ...
Update tests for CLI and domdown changes

0.1.0 (2026-04-07)

First release.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

jmcristobal

These details have not been verified by PyPI

Intended Audience
- Developers
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.2.0

Jun 4, 2026

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlsmith-0.2.0.tar.gz (19.5 kB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawlsmith-0.2.0-py2.py3-none-any.whl (13.1 kB view details)

Uploaded Jun 4, 2026 Python 2Python 3

File details

Details for the file crawlsmith-0.2.0.tar.gz.

File metadata

Download URL: crawlsmith-0.2.0.tar.gz
Upload date: Jun 4, 2026
Size: 19.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crawlsmith-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`5604755913f5f49afe2427bea34db8b63ceae41c270802d70114f703aa9a7984`
MD5	`e6c53bcddea6e4b4af8efbb3aca4f230`
BLAKE2b-256	`8fe6a2982523b24b080d33892b835c019e492a1050edfec75b20711d31d5bc85`

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlsmith-0.2.0.tar.gz:

Publisher: publish.yml on juanmcristobal/crawlsmith

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: crawlsmith-0.2.0.tar.gz
- Subject digest: 5604755913f5f49afe2427bea34db8b63ceae41c270802d70114f703aa9a7984
- Sigstore transparency entry: 1715616741
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: juanmcristobal/crawlsmith@735becdf6cbbc1ff969590566df9e74f71ad4fd3
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/juanmcristobal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@735becdf6cbbc1ff969590566df9e74f71ad4fd3
- Trigger Event: push

File details

Details for the file crawlsmith-0.2.0-py2.py3-none-any.whl.

File metadata

Download URL: crawlsmith-0.2.0-py2.py3-none-any.whl
Upload date: Jun 4, 2026
Size: 13.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crawlsmith-0.2.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`46e6cf65c1f761158cc803b2626030a0bcb2da8b5351754676de2aa7814e30c1`
MD5	`846d5c64046fd447e0e6eca94a185c72`
BLAKE2b-256	`ca2050913f25b2aa9d6a0ccf23107fab6aef28e8f3588edba2ac0d9e058bab8d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlsmith-0.2.0-py2.py3-none-any.whl:

Publisher: publish.yml on juanmcristobal/crawlsmith

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: crawlsmith-0.2.0-py2.py3-none-any.whl
- Subject digest: 46e6cf65c1f761158cc803b2626030a0bcb2da8b5351754676de2aa7814e30c1
- Sigstore transparency entry: 1715616814
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: juanmcristobal/crawlsmith@735becdf6cbbc1ff969590566df9e74f71ad4fd3
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/juanmcristobal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@735becdf6cbbc1ff969590566df9e74f71ad4fd3
- Trigger Event: push

crawlsmith 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CrawlSmith

Features

Installation

Quick Start

Python Usage

Basic Fetch

Read HTML and Markdown

Access Structured Metadata

Use Proxies

Control Browser Impersonation

Configure TLS and Timeouts

Handle Errors Explicitly

Serialize Results

CLI Usage

Basic CLI Request

Print the Response Body

Print Markdown Version

Use One or More Proxies

Force an Impersonation Profile

Change Timeout or Disable TLS Verification

CLI Exit Codes

CLI Help

Result Model

Support & Connect

History

0.2.0 (2026-06-04)

0.1.0 (2026-04-07)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance