Crawlsmith helps you craft reliable web crawlers in Python, combining page fetching, HTML parsing, link discovery, and content extraction into a simple and extensible toolkit.
Project description
CrawlSmith
Crawlsmith is a Python scraping toolkit for fetching web pages with
curl_cffi, extracting readable content, detecting common anti-bot
interstitials, and returning structured metadata in a single result object.
It is designed for Python developers who want a small, pragmatic interface for:
- fetching HTML or XML content
- converting HTML to Markdown
- rotating browser impersonation profiles
- trying multiple proxies
- classifying HTTP and network failures
- extracting document, Open Graph, Twitter, and HTTP metadata
Features
- Async-first Python API built around
CurlCffiScraper - Structured
FetchResultobject with success state, content, Markdown, and metadata - Automatic browser fingerprint headers and
curl_cffiimpersonation support - Proxy rotation with early success and retry limits
- Detection of common anti-bot challenge pages such as Cloudflare-style interstitials
- Gzip payload handling for compressed responses and feeds
- Built-in CLI for quick fetch, inspection, and debugging
Installation
Install from PyPI:
pip install crawlsmith
Requirements:
- Python 3.10+
Quick Start
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper()
result = await scraper.fetch("https://example.com")
if result.ok:
print(result.status)
print(result.content[:200])
print(result.markdown[:200])
else:
print(result.error_type, result.error)
asyncio.run(main())
Python Usage
Basic Fetch
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper()
result = await scraper.fetch("https://example.com")
if not result.ok:
raise RuntimeError(f"{result.error_type}: {result.error}")
print("Status:", result.status)
print("URL:", result.url)
print("Content length:", result.content_length)
asyncio.run(main())
Read HTML and Markdown
When a request succeeds with HTTP 200, Crawlsmith returns both the raw response
body and a Markdown representation.
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper()
result = await scraper.fetch("https://example.com")
if result.ok:
html = result.content
markdown = result.markdown
print(html[:300])
print(markdown[:300])
asyncio.run(main())
Access Structured Metadata
Each result includes metadata extracted from the response body and headers.
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper()
result = await scraper.fetch("https://example.com")
metadata = result.metadata or {}
document = metadata.get("document", {})
open_graph = metadata.get("open_graph", {})
twitter = metadata.get("twitter", {})
http = metadata.get("http", {})
print("Title:", document.get("title"))
print("Description:", document.get("description"))
print("Canonical URL:", document.get("canonical_url"))
print("OG Title:", open_graph.get("title"))
print("Twitter Card:", twitter.get("card"))
print("Final URL:", http.get("final_url"))
asyncio.run(main())
Use Proxies
Pass a list of proxies. Crawlsmith will shuffle them, try up to three unique entries, and return as soon as one succeeds with enough content.
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper(
proxies=[
"http://user:pass@proxy-1.example:8080",
"http://user:pass@proxy-2.example:8080",
"proxy-3.example:8080",
],
min_content_length=2000,
)
result = await scraper.fetch("https://example.com")
print(result.ok, result.via_proxy, result.proxy_url)
asyncio.run(main())
Control Browser Impersonation
You can force a specific curl_cffi impersonation profile instead of using the
default randomized behavior.
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper(impersonate="chrome120")
result = await scraper.fetch("https://example.com")
print(result.status, result.error_type)
asyncio.run(main())
Configure TLS and Timeouts
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper(
verify=True,
connect_timeout=5,
read_timeout=20,
)
result = await scraper.fetch("https://example.com")
print(result.to_dict())
asyncio.run(main())
If you need to disable TLS certificate verification for a controlled internal
environment, set verify=False.
Handle Errors Explicitly
Failures are returned as structured results instead of raising request errors in normal operation.
import asyncio
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper()
result = await scraper.fetch("https://example.com")
if result.ok:
print("Fetched successfully")
return
print("Error type:", result.error_type)
print("Error message:", result.error)
print("HTTP status:", result.status)
print("Blocked:", result.is_blocked)
asyncio.run(main())
Common error types include:
TIMEOUTCONNECTIONSSLINVALID_URLBLOCKEDHTTP_403HTTP_429HTTP_4XXHTTP_5XXUNKNOWN
Serialize Results
FetchResult can be converted directly into a plain dictionary for logging,
storage, or JSON serialization.
import asyncio
import json
from crawlsmith import CurlCffiScraper
async def main() -> None:
scraper = CurlCffiScraper()
result = await scraper.fetch("https://example.com")
print(json.dumps(result.to_dict(), indent=2))
asyncio.run(main())
CLI Usage
The package installs a crawlsmith command for quick fetches from the terminal.
Basic CLI Request
crawlsmith https://example.com
The CLI prints a JSON-serialized FetchResult to stdout.
Print the Response Body
crawlsmith https://example.com --print-content
Use One or More Proxies
crawlsmith https://example.com \
--proxy http://user:pass@proxy-1.example:8080 \
--proxy http://user:pass@proxy-2.example:8080 \
--min-content-length 2000
Force an Impersonation Profile
crawlsmith https://example.com --impersonate chrome120
Change Timeout or Disable TLS Verification
crawlsmith https://example.com --timeout 20
crawlsmith https://example.com --insecure
CLI Exit Codes
0when the request succeeds1when the request fails
CLI Help
crawlsmith --help
Result Model
FetchResult exposes the following fields:
ok: whether the request was considered successfulurl: requested URLstatus: HTTP status code when availablecontent: raw response text when successfulmarkdown: Markdown conversion of the response body when successfulmetadata: extracted document and HTTP metadataerror_type: normalized error classificationerror: human-readable error summaryvia_proxy: whether the successful or failed attempt used a proxyproxy_url: proxy used for the final attempt, if anycontent_length: UTF-8 byte length of the extracted textis_blocked: whether the response looks like an anti-bot interstitial
Support & Connect
- ⭐ Star the repo if you found it useful
- ☕ Support me: Say thanks by buying me a coffee! https://buymeacoffee.com/juanmcristobal
- 💼 Open to work: https://www.linkedin.com/in/jmcristobal/
History
0.1.0 (2026-04-07)
- First release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlsmith-0.1.0.tar.gz.
File metadata
- Download URL: crawlsmith-0.1.0.tar.gz
- Upload date:
- Size: 18.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e7e95f4c0befac24ff954121cd415a027a184967f48ec150cdba804420da862
|
|
| MD5 |
63dca1ea466846d4c7d9e009f682a4dc
|
|
| BLAKE2b-256 |
6a5bf267e66d8e29a13571d85c3ce9fb16bd402f9e4c5e823921d953bb527e26
|
Provenance
The following attestation bundles were made for crawlsmith-0.1.0.tar.gz:
Publisher:
publish.yml on juanmcristobal/crawlsmith
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crawlsmith-0.1.0.tar.gz -
Subject digest:
8e7e95f4c0befac24ff954121cd415a027a184967f48ec150cdba804420da862 - Sigstore transparency entry: 1252535847
- Sigstore integration time:
-
Permalink:
juanmcristobal/crawlsmith@3a057f2acf0722de753684e08c9d904ddaa02c0b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/juanmcristobal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3a057f2acf0722de753684e08c9d904ddaa02c0b -
Trigger Event:
push
-
Statement type:
File details
Details for the file crawlsmith-0.1.0-py2.py3-none-any.whl.
File metadata
- Download URL: crawlsmith-0.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98deadd3a423d8ad7ccd96b4559856c71195187dc4b51a70d1f07343d1d80a46
|
|
| MD5 |
ebaa0ea99114a9d4b83ae80707b97392
|
|
| BLAKE2b-256 |
6e0b05fe4fdee2baa52659c71cd959363a7deaf373fcd237c57152bf37dcfc98
|
Provenance
The following attestation bundles were made for crawlsmith-0.1.0-py2.py3-none-any.whl:
Publisher:
publish.yml on juanmcristobal/crawlsmith
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crawlsmith-0.1.0-py2.py3-none-any.whl -
Subject digest:
98deadd3a423d8ad7ccd96b4559856c71195187dc4b51a70d1f07343d1d80a46 - Sigstore transparency entry: 1252535883
- Sigstore integration time:
-
Permalink:
juanmcristobal/crawlsmith@3a057f2acf0722de753684e08c9d904ddaa02c0b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/juanmcristobal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3a057f2acf0722de753684e08c9d904ddaa02c0b -
Trigger Event:
push
-
Statement type: