curl-reap

Reap the web: browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine in one library.

These details have not been verified by PyPI

Project links

Project description

curl_reap

Reap the web. Browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine, in one small library.

pip install curl_reap

Documentation · PyPI · Source

Python versions MIT license

Full documentation with deep API reference and examples: https://anishfyi.github.io/curl_reap/

Why

Modern scraping needs three things, and today you reach for three different tools:

Get past the door. Sites fingerprint your TLS handshake and block stock Python clients. curl_cffi solves this with real Chrome/Safari fingerprints.
Survive markup changes. Plain CSS and XPath break the moment a site renames a class. Scrapling pioneered self-healing selectors that re-find the element anyway.
Crawl at scale. Concurrency, throttling, retries, dedup, and pipelines. That is Scrapy.

curl_reap takes the best idea from each and puts them behind one friendly API.

	curl_cffi	Scrapy	Scrapling	curl_reap
Real browser TLS / JA3	yes	no	partial	yes
Parser built in	no	yes	yes	yes
Self-healing selectors	no	no	yes	yes
Concurrent crawl engine	no	yes	no	yes
AutoThrottle, retries, pipelines	no	yes	no	yes
One small dependency set	yes	no	no	yes

Install

pip install curl_reap

Requires Python 3.9+. Pulls in curl_cffi, lxml, and cssselect.

Quick start

A one-shot fetch parses like parsel, but the request carries a genuine browser fingerprint:

import curl_reap as reap

page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))

Self-healing selectors

Save an element once. Later, even if the site renames the class or moves the node, auto_match relocates it by structural signature:

page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")     # remember its shape

# weeks later, the class is now "purchase-cta" and the old selector misses:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href"))                            # found anyway

Other finders: page.find_by_text("Sign in") and page.find_similar(some_element).

Crawl at scale

A Spider yields items (dicts) and more Request objects. The engine handles concurrency, AutoThrottle, retries, dedup, and pipelines:

import curl_reap as reap
from curl_reap import JsonLinesPipeline

class Quotes(reap.Spider):
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, page):
        for q in page.css("div.quote"):
            yield {
                "text": q.css_first("span.text::text"),
                "author": q.css_first("small.author::text"),
            }
        nxt = page.css_first("li.next a::attr(href)")
        if nxt:
            yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)

items = reap.run(
    Quotes,
    concurrency=8,
    throttle=True,                       # AutoThrottle adapts to server latency
    pipelines=[JsonLinesPipeline("quotes.jsonl")],
)
print(len(items), "items reaped")

API at a glance

reap.get(url, impersonate="chrome124", **kw) and reap.post(...) return a Response you can .css() / .xpath() directly.
reap.Session(impersonate=..., headers=..., retries=...) for a reusable client.
Selector / SelectorList: .css, .css_first, .xpath, .find_by_text, .find_similar, .save, .re, .text, .attr.
reap.Spider, reap.Request, reap.run(spider, ...), reap.Reaper(...).
Pipelines: DedupPipeline, JsonLinesPipeline, CsvPipeline, or subclass Pipeline.
reap.Geocoder().geocode(name, area, city, country): turn a name or address into coordinates with a precision label (name, district, or city), cached and rate limited.

Legal and acceptable use

curl_reap impersonates a real browser at the TLS level, which is what a normal browser does. It does not solve CAPTCHAs, bypass logins or paywalls, or defeat anti-bot services (Cloudflare, DataDome, PerimeterX, Akamai). If a site is actively blocking you, that block is the line to respect. You are responsible for checking robots.txt and each site's terms, not circumventing technical access controls, handling personal data lawfully (GDPR / CCPA), and respecting copyright. Provided under MIT, "as is", with no warranty.

Full notice and your responsibilities as a user: LEGAL.md.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jun 30, 2026

0.1.1

Jun 30, 2026

0.1.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curl_reap-0.1.2.tar.gz (13.1 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

curl_reap-0.1.2-py3-none-any.whl (15.9 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file curl_reap-0.1.2.tar.gz.

File metadata

Download URL: curl_reap-0.1.2.tar.gz
Upload date: Jun 30, 2026
Size: 13.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`b37a346f92dce13336249b6c3556ed2d1523e25be7504d70e1f4062452e2718b`
MD5	`29e0238e2278f3480140694b0ef55563`
BLAKE2b-256	`fb00f70007674493e116c4328c1dc38abb659cc5b9730d9f5d3eab2007fc50fa`

See more details on using hashes here.

File details

Details for the file curl_reap-0.1.2-py3-none-any.whl.

File metadata

Download URL: curl_reap-0.1.2-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a2e22df079b30b48c484c30911a9a54df56e7c4abd06251d5fe0fc311a20769`
MD5	`74a885696bb7fe84708dce4b1ca23fe9`
BLAKE2b-256	`ecdc05ea131ffe7b6be621511413d86c96c57a6b6d19b2d8b05e64958f88808c`

See more details on using hashes here.

curl-reap 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why

Install

Quick start

Self-healing selectors

Crawl at scale

API at a glance

Legal and acceptable use

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes