curl-reap

Reap the web: browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine in one library.

These details have not been verified by PyPI

Project links

Project description

curl_reap

Reap the web. Browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine, in one small library.

pip install curl_reap

Why

Modern scraping needs three things, and today you reach for three different tools:

Get past the door. Sites fingerprint your TLS handshake and block stock Python clients. curl_cffi solves this with real Chrome/Safari fingerprints.
Survive markup changes. Plain CSS and XPath break the moment a site renames a class. Scrapling pioneered self-healing selectors that re-find the element anyway.
Crawl at scale. Concurrency, throttling, retries, dedup, and pipelines. That is Scrapy.

curl_reap takes the best idea from each and puts them behind one friendly API.

	curl_cffi	Scrapy	Scrapling	curl_reap
Real browser TLS / JA3	yes	no	partial	yes
Parser built in	no	yes	yes	yes
Self-healing selectors	no	no	yes	yes
Concurrent crawl engine	no	yes	no	yes
AutoThrottle, retries, pipelines	no	yes	no	yes
One small dependency set	yes	no	no	yes

Install

pip install curl_reap

Requires Python 3.9+. Pulls in curl_cffi, lxml, and cssselect.

Quick start

A one-shot fetch parses like parsel, but the request carries a genuine browser fingerprint:

import curl_reap as reap

page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))

Self-healing selectors

Save an element once. Later, even if the site renames the class or moves the node, auto_match relocates it by structural signature:

page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")     # remember its shape

# weeks later, the class is now "purchase-cta" and the old selector misses:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href"))                            # found anyway

Other finders: page.find_by_text("Sign in") and page.find_similar(some_element).

Crawl at scale

A Spider yields items (dicts) and more Request objects. The engine handles concurrency, AutoThrottle, retries, dedup, and pipelines:

import curl_reap as reap
from curl_reap import JsonLinesPipeline

class Quotes(reap.Spider):
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, page):
        for q in page.css("div.quote"):
            yield {
                "text": q.css_first("span.text::text"),
                "author": q.css_first("small.author::text"),
            }
        nxt = page.css_first("li.next a::attr(href)")
        if nxt:
            yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)

items = reap.run(
    Quotes,
    concurrency=8,
    throttle=True,                       # AutoThrottle adapts to server latency
    pipelines=[JsonLinesPipeline("quotes.jsonl")],
)
print(len(items), "items reaped")

API at a glance

reap.get(url, impersonate="chrome124", **kw) and reap.post(...) return a Response you can .css() / .xpath() directly.
reap.Session(impersonate=..., headers=..., retries=...) for a reusable client.
Selector / SelectorList: .css, .css_first, .xpath, .find_by_text, .find_similar, .save, .re, .text, .attr.
reap.Spider, reap.Request, reap.run(spider, ...), reap.Reaper(...).
Pipelines: DedupPipeline, JsonLinesPipeline, CsvPipeline, or subclass Pipeline.

Responsible use

curl_reap impersonates a real browser at the TLS level, which is exactly what a normal browser does. It does not ship a challenge solver and it will not break CAPTCHAs or anti-bot walls (Cloudflare challenges, DataDome, PerimeterX, and similar). If a site has deliberately put up an access-control wall, that is a signal to stop. Respect robots.txt and each site's terms, throttle your crawls, and only collect data you are allowed to collect.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Jun 30, 2026

0.1.1

Jun 30, 2026

This version

0.1.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curl_reap-0.1.0.tar.gz (10.8 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

curl_reap-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file curl_reap-0.1.0.tar.gz.

File metadata

Download URL: curl_reap-0.1.0.tar.gz
Upload date: Jun 30, 2026
Size: 10.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ccd66ff7705a13e9d3b93b2df0dd7ae45620c4ffb60e3a2f33666b282068a0b2`
MD5	`db8339dc4cd1b978c7f7f7f55b327fbe`
BLAKE2b-256	`e3d63448fd5b8d823df7fe55f712c79641522bb02ec2a26afe85eb633356eb80`

See more details on using hashes here.

File details

Details for the file curl_reap-0.1.0-py3-none-any.whl.

File metadata

Download URL: curl_reap-0.1.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9b44c327b3c361afcdb542131510dc4bdc798149387f1044da2b958b9781d789`
MD5	`80f2fe64b4bd00a7014bf83efbdd58dc`
BLAKE2b-256	`d55fab93e936caf32b7bafdf856502502e11b5df65aabedf889640f6eef291e9`

See more details on using hashes here.

curl-reap 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why

Install

Quick start

Self-healing selectors

Crawl at scale

API at a glance

Responsible use

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes