Skip to main content

HTTP scraper with Cloudflare bypass, browser fingerprint impersonation, stealth mode, proxy support, and a null-safe BeautifulSoup wrapper.

Project description

LNCrawl Scraper

CI Coverage PyPI

HTTP scraper with Cloudflare bypass, browser fingerprint impersonation, stealth mode, proxy support, and a null-safe BeautifulSoup wrapper.

Features

  • Cloudflare bypass — handles CF challenges v1, v2, v3, and Turnstile transparently
  • Browser fingerprint impersonation — optional curl_cffi transport that reproduces a real Chrome/Firefox TLS (JA3/JA4) and HTTP/2 fingerprint
  • Browser-assisted clearance — reuse a cf_clearance cookie solved by a real browser for managed-challenge / Turnstile sites
  • Accurate Client Hintssec-ch-ua / sec-fetch-* derived from the chosen UA
  • Stealth mode — human-like delays, randomized headers, browser quirks
  • Proxy support — round-robin proxy rotation with Tor integration and direct fallback
  • Rate limiting — configurable per-request intervals and concurrency cap
  • PageSoup — null-safe BeautifulSoup wrapper; selection methods never return None
  • HTTP helpersget_soup, get_json, get_image, get_file, and more

Installation

pip install lncrawl-scraper

# optional extras:
pip install "lncrawl-scraper[impersonate]"   # browser TLS/HTTP-2 impersonation (curl_cffi)
pip install "lncrawl-scraper[image]"         # get_image() support (Pillow)

Quick start

from scraper import Scraper

s = Scraper(origin="https://example.com")

# HTML
soup = s.get_soup("https://example.com/page")
title = soup.select_one("h1.title").text          # "" if not found, never raises
links = [a["href"] for a in soup.select("a")]

# JSON
data = s.get_json("https://example.com/api/data")

# File download
s.get_file("https://example.com/file.zip", output_file="file.zip")

# Image (returns PIL.Image)
img = s.get_image("https://example.com/cover.jpg")

Examples

Runnable examples live in examples/ — run any with uv run python examples/<file>.py.

Example Shows
01_basic_html.py Fetch a page and extract data with get_soup / PageSoup
02_pagesoup_parsing.py PageSoup tour: CSS select, attrs, navigation, XPath
03_json_api.py get_json / post_json and raw Response access
04_files_and_images.py get_file (streamed, atomic) and get_image (Pillow)
05_forms_cookies_headers.py submit_form, set_header, set_cookie, reset
06_configuration.py ScraperConfig, default_config(), stealth, browser identity
07_impersonation.py Real browser TLS/HTTP-2 fingerprint via impersonate
08_browser_clearance.py Reuse a cf_clearance solved by a real browser
09_proxies_and_tor.py Proxy rotation and Tor identity refresh
10_concurrency_and_abort.py Threaded fetches and cooperative abort()
11_error_handling.py HTTP, Cloudflare, and abort error handling

Configuration

Pass a ScraperConfig for full control:

from scraper import Scraper
from scraper.config import ScraperConfig, ProxyConfig, StealthConfig, BrowserConfig

config = ScraperConfig(
    min_request_interval=2.0,
    max_concurrent_requests=1,
    rotate_tls_ciphers=True,
    stealth=StealthConfig(
        enabled=True,
        min_delay=1.0,
        max_delay=3.0,
        human_like_delays=True,
        randomize_headers=True,
        browser_quirks=True,
    ),
    proxy=ProxyConfig(
        proxy_urls=["http://proxy1:8080", "http://proxy2:8080"],
        fallback_to_direct=True,
    ),
    browser=BrowserConfig(browser="firefox", platform="windows", desktop=True),
)

s = Scraper(origin="https://example.com", config=config)

Or start from the library's tuned defaults and tweak:

from scraper import Scraper, default_config

config = default_config()
config.max_concurrent_requests = 4
s = Scraper(origin="https://example.com", config=config)

Browser fingerprint impersonation

A plain requests stack has a fixed OpenSSL TLS fingerprint and only speaks HTTP/1.1 — both of which modern Cloudflare detects. Set impersonate (requires the impersonate extra) to route requests through curl_cffi, reproducing a real browser's TLS (JA3/JA4) and HTTP/2 fingerprint:

from scraper import Scraper, default_config

config = default_config()
config.impersonate = "chrome"   # or "firefox", "chrome124", "safari", …
s = Scraper(origin="https://example.com", config=config)

The spoofed User-Agent family and Client Hints are aligned with the impersonation target automatically.

Browser-assisted clearance

For managed challenges / Turnstile that can't be solved headlessly, solve the challenge once in a real browser (e.g. nodriver/Playwright), then hand the cf_clearance cookie and the browser's exact User-Agent to the session:

s.apply_browser_clearance(
    "https://protected.example.com",
    cf_clearance="<value from the browser>",
    user_agent="<the browser's exact UA>",
    cookies={"__cf_bm": "<optional>"},
)

Scraper API

Method Description
get(url, **kwargs) GET request, returns Response
post(url, **kwargs) POST request, returns Response
ping(url, timeout=5) HEAD request for reachability check
submit_form(url, data, ...) POST with form encoding or multipart
get_json(url, headers, ...) GET and parse response as JSON
post_json(url, data, ...) POST and parse response as JSON
get_soup(url, headers, ...) GET and return a PageSoup
post_soup(url, data, ...) POST and return a PageSoup
get_image(url, ...) GET and return a PIL.Image
get_file(url, output_file, ...) Stream download to file (abort-safe)
make_soup(data, encoding, ...) Parse Response, bytes, or str into PageSoup
set_header(key, value) Set a default session header
set_cookie(name, value) Set a session cookie
reset() Clear cookies, headers, and state

PageSoup API

PageSoup wraps a BeautifulSoup Tag. Every selection method returns a PageSoup (never None); an empty PageSoup is falsy and returns safe defaults for all operations.

soup = s.get_soup("https://example.com")

# Selection
soup.select("ul li")                 # → List[PageSoup]
soup.select_one(".title")            # → PageSoup (empty if not found)
soup.find("div", class_="content")  # → PageSoup
soup.find_all("a")                   # → List[PageSoup]
soup.xpath("//div[@class='body']")  # → List[PageSoup]
soup.closest(".container")          # → nearest matching ancestor
soup.parents(".wrapper")            # → generator of matching ancestors

# Attribute access
el["href"]                           # get_attr shorthand, returns "" if missing
el.get_attr("src", default="/")
el.has_attr("data-id")

# Text / HTML
el.text                              # stripped text, always str
el.get_text(separator="\n")
el.inner_html
el.outer_html

# Navigation
el.parent
el.children                          # List[PageSoup], excludes text nodes
el.next_sibling
el.previous_sibling

# Mutation
soup.decompose(".ads")               # remove elements matching selector
el.replace_with(new_el)
el.append(child)

Development

uv is required. Clone the repo and install all dependencies including dev extras:

git clone https://github.com/lncrawl/scraper.git
cd scraper
uv sync --all-groups --all-extras

Tasks are managed with poethepoet:

Command Description
uv run poe lint Run ruff + pyright
uv run poe lint-fix Auto-fix ruff violations and reformat
uv run poe test Run the test suite
uv run poe build Lint → test → build wheel
uv run poe publish Build → publish to PyPI

Testing

Tests live in tests/ and run with pytest:

uv run poe test

# or directly
uv run pytest
uv run pytest -v                   # verbose
uv run pytest tests/test_dummy.py  # a single file

Mock HTTP with responses (a dev dependency) so tests make no real network calls.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lncrawl_scraper-0.1.0.tar.gz (167.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lncrawl_scraper-0.1.0-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file lncrawl_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: lncrawl_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 167.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lncrawl_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4ca0440abfde49e7b071b8572340e1799fdd99ab19d2a5a30928b794f9ff751e
MD5 6cc9b22de4244c317fdad95fd3f5447f
BLAKE2b-256 23812efeb4a3f74aa7d2b7d3e55cbd9c87792fc67dc7d58bcd011269df413415

See more details on using hashes here.

File details

Details for the file lncrawl_scraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lncrawl_scraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 52.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lncrawl_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d52b089d8924a232d4494fbb929fdbec3d04d9ab3c14c491e976f8d47c8127f
MD5 b7f518ae7cbaf0c2170c4ecf2f05baff
BLAKE2b-256 79c0f1b8b47cfe5aa4e7f1e466702b0d2490e93287282d95593746cc91010263

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page