Skip to main content

Reap the web: browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine in one library.

Project description

curl_reap

Reap the web. Browser-grade TLS impersonation, self-healing selectors, and a concurrent crawl engine, in one small library.

pip install curl_reap

Documentation  ·  PyPI  ·  Source

PyPI version Python versions MIT license Docs


Full documentation with deep API reference and examples: https://anishfyi.github.io/curl_reap/

Why

Modern scraping needs three things, and today you reach for three different tools:

  1. Get past the door. Sites fingerprint your TLS handshake and block stock Python clients. curl_cffi solves this with real Chrome/Safari fingerprints.
  2. Survive markup changes. Plain CSS and XPath break the moment a site renames a class. Scrapling pioneered self-healing selectors that re-find the element anyway.
  3. Crawl at scale. Concurrency, throttling, retries, dedup, and pipelines. That is Scrapy.

curl_reap takes the best idea from each and puts them behind one friendly API.

curl_cffi Scrapy Scrapling curl_reap
Real browser TLS / JA3 yes no partial yes
Parser built in no yes yes yes
Self-healing selectors no no yes yes
Concurrent crawl engine no yes no yes
AutoThrottle, retries, pipelines no yes no yes
One small dependency set yes no no yes

Install

pip install curl_reap

Requires Python 3.9+. Pulls in curl_cffi, lxml, and cssselect.

Quick start

A one-shot fetch parses like parsel, but the request carries a genuine browser fingerprint:

import curl_reap as reap

page = reap.get("https://quotes.toscrape.com", impersonate="chrome124")
print(page.css("span.text::text").getall())
print(page.css_first("small.author::text"))

Self-healing selectors

Save an element once. Later, even if the site renames the class or moves the node, auto_match relocates it by structural signature:

page = reap.get("https://shop.example.com/item/42")
page.css_first("a.buy-btn").save("buy_button")     # remember its shape

# weeks later, the class is now "purchase-cta" and the old selector misses:
later = reap.get("https://shop.example.com/item/99")
btn = later.css_first("a.buy-btn", auto_match=True, identifier="buy_button")
print(btn.attr("href"))                            # found anyway

Other finders: page.find_by_text("Sign in") and page.find_similar(some_element).

Crawl at scale

A Spider yields items (dicts) and more Request objects. The engine handles concurrency, AutoThrottle, retries, dedup, and pipelines:

import curl_reap as reap
from curl_reap import JsonLinesPipeline

class Quotes(reap.Spider):
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, page):
        for q in page.css("div.quote"):
            yield {
                "text": q.css_first("span.text::text"),
                "author": q.css_first("small.author::text"),
            }
        nxt = page.css_first("li.next a::attr(href)")
        if nxt:
            yield reap.Request("https://quotes.toscrape.com" + nxt, self.parse)

items = reap.run(
    Quotes,
    concurrency=8,
    throttle=True,                       # AutoThrottle adapts to server latency
    pipelines=[JsonLinesPipeline("quotes.jsonl")],
)
print(len(items), "items reaped")

API at a glance

  • reap.get(url, impersonate="chrome124", **kw) and reap.post(...) return a Response you can .css() / .xpath() directly.
  • reap.Session(impersonate=..., headers=..., retries=...) for a reusable client.
  • Selector / SelectorList: .css, .css_first, .xpath, .find_by_text, .find_similar, .save, .re, .text, .attr.
  • reap.Spider, reap.Request, reap.run(spider, ...), reap.Reaper(...).
  • Pipelines: DedupPipeline, JsonLinesPipeline, CsvPipeline, or subclass Pipeline.
  • reap.Geocoder().geocode(name, area, city, country): turn a name or address into coordinates with a precision label (name, district, or city), cached and rate limited.

Legal and acceptable use

curl_reap impersonates a real browser at the TLS level, which is what a normal browser does. It does not solve CAPTCHAs, bypass logins or paywalls, or defeat anti-bot services (Cloudflare, DataDome, PerimeterX, Akamai). If a site is actively blocking you, that block is the line to respect. You are responsible for checking robots.txt and each site's terms, not circumventing technical access controls, handling personal data lawfully (GDPR / CCPA), and respecting copyright. Provided under MIT, "as is", with no warranty.

Full notice and your responsibilities as a user: LEGAL.md.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

curl_reap-0.1.2.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

curl_reap-0.1.2-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file curl_reap-0.1.2.tar.gz.

File metadata

  • Download URL: curl_reap-0.1.2.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b37a346f92dce13336249b6c3556ed2d1523e25be7504d70e1f4062452e2718b
MD5 29e0238e2278f3480140694b0ef55563
BLAKE2b-256 fb00f70007674493e116c4328c1dc38abb659cc5b9730d9f5d3eab2007fc50fa

See more details on using hashes here.

File details

Details for the file curl_reap-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: curl_reap-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for curl_reap-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2a2e22df079b30b48c484c30911a9a54df56e7c4abd06251d5fe0fc311a20769
MD5 74a885696bb7fe84708dce4b1ca23fe9
BLAKE2b-256 ecdc05ea131ffe7b6be621511413d86c96c57a6b6d19b2d8b05e64958f88808c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page