Reap the web: browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine in one library.
Project description
Reap the web. Browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine, in one small library.
pip install curl_reap
Why
Modern scraping needs three things, and today you reach for three different tools:
- Get past the door. Sites fingerprint your TLS handshake and block stock Python clients.
curl_cffisolves this with real Chrome/Safari fingerprints. - Survive markup changes. Plain CSS and XPath break the moment a site renames a class. Scrapling pioneered self-healing selectors that re-find the element anyway.
- Crawl at scale. Concurrency, throttling, retries, dedup, and pipelines. That is Scrapy.
curl_reap takes the best idea from each and puts them behind one friendly API.
| curl_cffi | Scrapy | Scrapling | curl_reap | |
|---|---|---|---|---|
| Real browser TLS / JA3 | yes | no | partial | yes |
| Parser built in | no | yes | yes | yes |
| Self-healing selectors | no | no | yes | yes |
| Concurrent crawl engine | no | yes | no | yes |
| AutoThrottle, retries, pipelines | no | yes | no | yes |
| One small dependency set | yes | no | no | yes |
Install
pip install curl_reap
Requires Python 3.9+. Pulls in curl_cffi, lxml, and cssselect.
Quick start
A one-shot fetch parses like parsel, but the request carries a genuine browser fingerprint:
import curl_reap as reap
page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))
Self-healing selectors
Save an element once. Later, even if the site renames the class or moves the node, auto_match relocates it by structural signature:
page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button") # remember its shape
# weeks later, the class is now "purchase-cta" and the old selector misses:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href")) # found anyway
Other finders: page.find_by_text("Sign in") and page.find_similar(some_element).
Crawl at scale
A Spider yields items (dicts) and more Request objects. The engine handles concurrency, AutoThrottle, retries, dedup, and pipelines:
import curl_reap as reap
from curl_reap import JsonLinesPipeline
class Quotes(reap.Spider):
start_urls = ["https://quotes.toscrape.com"]
def parse(self, page):
for q in page.css("div.quote"):
yield {
"text": q.css_first("span.text::text"),
"author": q.css_first("small.author::text"),
}
nxt = page.css_first("li.next a::attr(href)")
if nxt:
yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)
items = reap.run(
Quotes,
concurrency=8,
throttle=True, # AutoThrottle adapts to server latency
pipelines=[JsonLinesPipeline("quotes.jsonl")],
)
print(len(items), "items reaped")
API at a glance
reap.get(url, impersonate="chrome124", **kw)andreap.post(...)return aResponseyou can.css()/.xpath()directly.reap.Session(impersonate=..., headers=..., retries=...)for a reusable client.Selector/SelectorList:.css,.css_first,.xpath,.find_by_text,.find_similar,.save,.re,.text,.attr.reap.Spider,reap.Request,reap.run(spider, ...),reap.Reaper(...).- Pipelines:
DedupPipeline,JsonLinesPipeline,CsvPipeline, or subclassPipeline.
Responsible use
curl_reap impersonates a real browser at the TLS level, which is exactly what a normal browser does. It does not ship a challenge solver and it will not break CAPTCHAs or anti-bot walls (Cloudflare challenges, DataDome, PerimeterX, and similar). If a site has deliberately put up an access-control wall, that is a signal to stop. Respect robots.txt and each site's terms, throttle your crawls, and only collect data you are allowed to collect.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file curl_reap-0.1.1.tar.gz.
File metadata
- Download URL: curl_reap-0.1.1.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b5630d1de9cc12190e376fe6fa86da1c31876b1d9fbaf23ad5621879bb3d531
|
|
| MD5 |
c9d45f136a7fc4a1d0af76cfeadbd653
|
|
| BLAKE2b-256 |
432d8ec48bda96b505dbf3a6c1453a5e9384e7b11ada7b11ca7617cd1f982c68
|
File details
Details for the file curl_reap-0.1.1-py3-none-any.whl.
File metadata
- Download URL: curl_reap-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea29ec1cffd4b0f98d06590a7dfb48a7cdac80c09aa18d5911269cdadaa5152d
|
|
| MD5 |
532f8bb479f8bfc2064a41fd33a0b045
|
|
| BLAKE2b-256 |
fa7ecbd04bc47002f57aa18bbde1f5ceaae6907e6c72be4e3806d148c83944bf
|