Skip to main content

Reap the web: browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine in one library.

Project description

curl_reap

Reap the web. Browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine, in one small library.

pip install curl_reap


Why

Modern scraping needs three things, and today you reach for three different tools:

  1. Get past the door. Sites fingerprint your TLS handshake and block stock Python clients. curl_cffi solves this with real Chrome/Safari fingerprints.
  2. Survive markup changes. Plain CSS and XPath break the moment a site renames a class. Scrapling pioneered self-healing selectors that re-find the element anyway.
  3. Crawl at scale. Concurrency, throttling, retries, dedup, and pipelines. That is Scrapy.

curl_reap takes the best idea from each and puts them behind one friendly API.

curl_cffi Scrapy Scrapling curl_reap
Real browser TLS / JA3 yes no partial yes
Parser built in no yes yes yes
Self-healing selectors no no yes yes
Concurrent crawl engine no yes no yes
AutoThrottle, retries, pipelines no yes no yes
One small dependency set yes no no yes

Install

pip install curl_reap

Requires Python 3.9+. Pulls in curl_cffi, lxml, and cssselect.

Quick start

A one-shot fetch parses like parsel, but the request carries a genuine browser fingerprint:

import curl_reap as reap

page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))

Self-healing selectors

Save an element once. Later, even if the site renames the class or moves the node, auto_match relocates it by structural signature:

page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")     # remember its shape

# weeks later, the class is now "purchase-cta" and the old selector misses:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href"))                            # found anyway

Other finders: page.find_by_text("Sign in") and page.find_similar(some_element).

Crawl at scale

A Spider yields items (dicts) and more Request objects. The engine handles concurrency, AutoThrottle, retries, dedup, and pipelines:

import curl_reap as reap
from curl_reap import JsonLinesPipeline

class Quotes(reap.Spider):
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, page):
        for q in page.css("div.quote"):
            yield {
                "text": q.css_first("span.text::text"),
                "author": q.css_first("small.author::text"),
            }
        nxt = page.css_first("li.next a::attr(href)")
        if nxt:
            yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)

items = reap.run(
    Quotes,
    concurrency=8,
    throttle=True,                       # AutoThrottle adapts to server latency
    pipelines=[JsonLinesPipeline("quotes.jsonl")],
)
print(len(items), "items reaped")

API at a glance

  • reap.get(url, impersonate="chrome124", **kw) and reap.post(...) return a Response you can .css() / .xpath() directly.
  • reap.Session(impersonate=..., headers=..., retries=...) for a reusable client.
  • Selector / SelectorList: .css, .css_first, .xpath, .find_by_text, .find_similar, .save, .re, .text, .attr.
  • reap.Spider, reap.Request, reap.run(spider, ...), reap.Reaper(...).
  • Pipelines: DedupPipeline, JsonLinesPipeline, CsvPipeline, or subclass Pipeline.

Responsible use

curl_reap impersonates a real browser at the TLS level, which is exactly what a normal browser does. It does not ship a challenge solver and it will not break CAPTCHAs or anti-bot walls (Cloudflare challenges, DataDome, PerimeterX, and similar). If a site has deliberately put up an access-control wall, that is a signal to stop. Respect robots.txt and each site's terms, throttle your crawls, and only collect data you are allowed to collect.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curl_reap-0.1.1.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

curl_reap-0.1.1-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file curl_reap-0.1.1.tar.gz.

File metadata

  • Download URL: curl_reap-0.1.1.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1b5630d1de9cc12190e376fe6fa86da1c31876b1d9fbaf23ad5621879bb3d531
MD5 c9d45f136a7fc4a1d0af76cfeadbd653
BLAKE2b-256 432d8ec48bda96b505dbf3a6c1453a5e9384e7b11ada7b11ca7617cd1f982c68

See more details on using hashes here.

File details

Details for the file curl_reap-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: curl_reap-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ea29ec1cffd4b0f98d06590a7dfb48a7cdac80c09aa18d5911269cdadaa5152d
MD5 532f8bb479f8bfc2064a41fd33a0b045
BLAKE2b-256 fa7ecbd04bc47002f57aa18bbde1f5ceaae6907e6c72be4e3806d148c83944bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page