Skip to main content

Reap the web: browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine in one library.

Project description

curl_reap

Reap the web. Browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine, in one small library.

pip install curl_reap


Why

Modern scraping needs three things, and today you reach for three different tools:

  1. Get past the door. Sites fingerprint your TLS handshake and block stock Python clients. curl_cffi solves this with real Chrome/Safari fingerprints.
  2. Survive markup changes. Plain CSS and XPath break the moment a site renames a class. Scrapling pioneered self-healing selectors that re-find the element anyway.
  3. Crawl at scale. Concurrency, throttling, retries, dedup, and pipelines. That is Scrapy.

curl_reap takes the best idea from each and puts them behind one friendly API.

curl_cffi Scrapy Scrapling curl_reap
Real browser TLS / JA3 yes no partial yes
Parser built in no yes yes yes
Self-healing selectors no no yes yes
Concurrent crawl engine no yes no yes
AutoThrottle, retries, pipelines no yes no yes
One small dependency set yes no no yes

Install

pip install curl_reap

Requires Python 3.9+. Pulls in curl_cffi, lxml, and cssselect.

Quick start

A one-shot fetch parses like parsel, but the request carries a genuine browser fingerprint:

import curl_reap as reap

page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))

Self-healing selectors

Save an element once. Later, even if the site renames the class or moves the node, auto_match relocates it by structural signature:

page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")     # remember its shape

# weeks later, the class is now "purchase-cta" and the old selector misses:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href"))                            # found anyway

Other finders: page.find_by_text("Sign in") and page.find_similar(some_element).

Crawl at scale

A Spider yields items (dicts) and more Request objects. The engine handles concurrency, AutoThrottle, retries, dedup, and pipelines:

import curl_reap as reap
from curl_reap import JsonLinesPipeline

class Quotes(reap.Spider):
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, page):
        for q in page.css("div.quote"):
            yield {
                "text": q.css_first("span.text::text"),
                "author": q.css_first("small.author::text"),
            }
        nxt = page.css_first("li.next a::attr(href)")
        if nxt:
            yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)

items = reap.run(
    Quotes,
    concurrency=8,
    throttle=True,                       # AutoThrottle adapts to server latency
    pipelines=[JsonLinesPipeline("quotes.jsonl")],
)
print(len(items), "items reaped")

API at a glance

  • reap.get(url, impersonate="chrome124", **kw) and reap.post(...) return a Response you can .css() / .xpath() directly.
  • reap.Session(impersonate=..., headers=..., retries=...) for a reusable client.
  • Selector / SelectorList: .css, .css_first, .xpath, .find_by_text, .find_similar, .save, .re, .text, .attr.
  • reap.Spider, reap.Request, reap.run(spider, ...), reap.Reaper(...).
  • Pipelines: DedupPipeline, JsonLinesPipeline, CsvPipeline, or subclass Pipeline.

Responsible use

curl_reap impersonates a real browser at the TLS level, which is exactly what a normal browser does. It does not ship a challenge solver and it will not break CAPTCHAs or anti-bot walls (Cloudflare challenges, DataDome, PerimeterX, and similar). If a site has deliberately put up an access-control wall, that is a signal to stop. Respect robots.txt and each site's terms, throttle your crawls, and only collect data you are allowed to collect.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curl_reap-0.1.0.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

curl_reap-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file curl_reap-0.1.0.tar.gz.

File metadata

  • Download URL: curl_reap-0.1.0.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ccd66ff7705a13e9d3b93b2df0dd7ae45620c4ffb60e3a2f33666b282068a0b2
MD5 db8339dc4cd1b978c7f7f7f55b327fbe
BLAKE2b-256 e3d63448fd5b8d823df7fe55f712c79641522bb02ec2a26afe85eb633356eb80

See more details on using hashes here.

File details

Details for the file curl_reap-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: curl_reap-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9b44c327b3c361afcdb542131510dc4bdc798149387f1044da2b958b9781d789
MD5 80f2fe64b4bd00a7014bf83efbdd58dc
BLAKE2b-256 d55fab93e936caf32b7bafdf856502502e11b5df65aabedf889640f6eef291e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page