HTTP scraper with Cloudflare bypass, browser fingerprint impersonation, stealth mode, proxy support, and a null-safe BeautifulSoup wrapper.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dipu-bd

These details have not been verified by PyPI

Project description

LNCrawl Scraper

PyPI - Python Version

HTTP scraper with Cloudflare bypass, browser fingerprint impersonation, stealth mode, proxy support, and a null-safe BeautifulSoup wrapper.

Features

Cloudflare bypass — handles CF challenges v1, v2, v3, and Turnstile transparently
Browser fingerprint impersonation — optional curl_cffi transport that reproduces a real Chrome/Firefox TLS (JA3/JA4) and HTTP/2 fingerprint
Browser-assisted clearance — reuse a cf_clearance cookie solved by a real browser for managed-challenge / Turnstile sites
Accurate Client Hints — sec-ch-ua / sec-fetch-* derived from the chosen UA
Stealth mode — human-like delays, randomized headers, browser quirks
Proxy support — round-robin proxy rotation with Tor integration and direct fallback
Rate limiting — configurable per-request intervals and concurrency cap
PageSoup — null-safe BeautifulSoup wrapper; selection methods never return None
HTTP helpers — get_soup, get_json, get_image, get_file, and more

Installation

pip install lncrawl-scraper

# optional extras:
pip install "lncrawl-scraper[impersonate]"   # browser TLS/HTTP-2 impersonation (curl_cffi)
pip install "lncrawl-scraper[image]"         # get_image() support (Pillow)

Quick start

from scraper import Scraper

s = Scraper(origin="https://example.com")

# HTML
soup = s.get_soup("https://example.com/page")
title = soup.select_one("h1.title").text          # "" if not found, never raises
links = [a["href"] for a in soup.select("a")]

# JSON
data = s.get_json("https://example.com/api/data")

# File download
s.get_file("https://example.com/file.zip", output_file="file.zip")

# Image (returns PIL.Image)
img = s.get_image("https://example.com/cover.jpg")

Examples

Runnable examples live in examples/ — run any with uv run python examples/<file>.py.

Example	Shows
01_basic_html.py	Fetch a page and extract data with `get_soup` / `PageSoup`
02_pagesoup_parsing.py	PageSoup tour: CSS select, attrs, navigation, XPath
03_json_api.py	`get_json` / `post_json` and raw `Response` access
04_files_and_images.py	`get_file` (streamed, atomic) and `get_image` (Pillow)
05_forms_cookies_headers.py	`submit_form`, `set_header`, `set_cookie`, `reset`
06_configuration.py	`ScraperConfig`, `default_config()`, stealth, browser identity
07_impersonation.py	Real browser TLS/HTTP-2 fingerprint via `impersonate`
08_browser_clearance.py	Reuse a `cf_clearance` solved by a real browser
09_proxies.py	Proxy rotation (HTTP/SOCKS/round-robin)
10_tor_proxy.py	Tor integration and identity refresh
11_error_handling.py	HTTP, Cloudflare, and abort error handling
12_concurrency_and_abort.py	Threaded fetches and cooperative `abort()`

Configuration

Pass a ScraperConfig for full control:

from scraper import Scraper
from scraper.config import ScraperConfig, ProxyConfig, StealthConfig, BrowserConfig

config = ScraperConfig(
    min_request_interval=2.0,
    max_concurrent_requests=1,
    rotate_tls_ciphers=True,
    stealth=StealthConfig(
        enabled=True,
        min_delay=1.0,
        max_delay=3.0,
        human_like_delays=True,
        randomize_headers=True,
        browser_quirks=True,
    ),
    proxy=ProxyConfig(
        proxy_urls=["http://proxy1:8080", "http://proxy2:8080"],
        fallback_to_direct=True,
    ),
    browser=BrowserConfig(browser="firefox", platform="windows", desktop=True),
)

s = Scraper(origin="https://example.com", config=config)

Or start from the library's tuned defaults and tweak:

from scraper import Scraper, default_config

config = default_config()
config.max_concurrent_requests = 4
s = Scraper(origin="https://example.com", config=config)

Browser fingerprint impersonation

A plain requests stack has a fixed OpenSSL TLS fingerprint and only speaks HTTP/1.1 — both of which modern Cloudflare detects. Set impersonate (requires the impersonate extra) to route requests through curl_cffi, reproducing a real browser's TLS (JA3/JA4) and HTTP/2 fingerprint:

from scraper import Scraper, default_config

config = default_config()
config.impersonate = "chrome"   # or "firefox", "chrome124", "safari", …
s = Scraper(origin="https://example.com", config=config)

The spoofed User-Agent family and Client Hints are aligned with the impersonation target automatically.

Browser-assisted clearance

For managed challenges / Turnstile that can't be solved headlessly, solve the challenge once in a real browser (e.g. nodriver/Playwright), then hand the cf_clearance cookie and the browser's exact User-Agent to the session:

s.apply_browser_clearance(
    "https://protected.example.com",
    cf_clearance="<value from the browser>",
    user_agent="<the browser's exact UA>",
    cookies={"__cf_bm": "<optional>"},
)

`Scraper` API

Method	Description
`get(url, **kwargs)`	GET request, returns `Response`
`post(url, **kwargs)`	POST request, returns `Response`
`ping(url, timeout=5)`	HEAD request for reachability check
`submit_form(url, data, ...)`	POST with form encoding or multipart
`get_json(url, headers, ...)`	GET and parse response as JSON
`post_json(url, data, ...)`	POST and parse response as JSON
`get_soup(url, headers, ...)`	GET and return a `PageSoup`
`post_soup(url, data, ...)`	POST and return a `PageSoup`
`get_image(url, ...)`	GET and return a `PIL.Image`
`get_file(url, output_file, ...)`	Stream download to file (abort-safe)
`make_soup(data, encoding, ...)`	Parse `Response`, `bytes`, or `str` into `PageSoup`
`set_header(key, value)`	Set a default session header
`set_cookie(name, value)`	Set a session cookie
`reset()`	Clear cookies, headers, and state

`PageSoup` API

PageSoup wraps a BeautifulSoup Tag. Every selection method returns a PageSoup (never None); an empty PageSoup is falsy and returns safe defaults for all operations.

soup = s.get_soup("https://example.com")

# Selection
soup.select("ul li")                 # → List[PageSoup]
soup.select_one(".title")            # → PageSoup (empty if not found)
soup.find("div", class_="content")  # → PageSoup
soup.find_all("a")                   # → List[PageSoup]
soup.xpath("//div[@class='body']")  # → List[PageSoup]
soup.closest(".container")          # → nearest matching ancestor
soup.parents(".wrapper")            # → generator of matching ancestors

# Attribute access
el["href"]                           # get_attr shorthand, returns "" if missing
el.get_attr("src", default="/")
el.has_attr("data-id")

# Text / HTML
el.text                              # stripped text, always str
el.get_text(separator="\n")
el.inner_html
el.outer_html

# Navigation
el.parent
el.children                          # List[PageSoup], excludes text nodes
el.next_sibling
el.previous_sibling

# Mutation
soup.decompose(".ads")               # remove elements matching selector
el.replace_with(new_el)
el.append(child)

Development

uv is required. Clone the repo and install all dependencies including dev extras:

git clone https://github.com/lncrawl/scraper.git
cd scraper
uv sync --all-groups --all-extras

Tasks are managed with poethepoet:

Command	Description
`uv run poe lint`	Run ruff + pyright
`uv run poe lint-fix`	Auto-fix ruff violations and reformat
`uv run poe test`	Run the test suite
`uv run poe build`	Lint → test → build wheel
`uv run poe publish`	Build → publish to PyPI

Testing

Tests live in tests/ and run with pytest:

uv run poe test

# or directly
uv run pytest
uv run pytest -v                   # verbose
uv run pytest tests/test_dummy.py  # a single file

Mock HTTP with responses (a dev dependency) so tests make no real network calls.

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dipu-bd

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.4

Jun 28, 2026

0.2.3

Jun 16, 2026

0.2.2

Jun 16, 2026

0.2.1

Jun 16, 2026

0.1.2

Jun 12, 2026

0.1.1

Jun 12, 2026

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lncrawl_scraper-0.2.4.tar.gz (177.0 kB view details)

Uploaded Jun 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lncrawl_scraper-0.2.4-py3-none-any.whl (52.8 kB view details)

Uploaded Jun 28, 2026 Python 3

File details

Details for the file lncrawl_scraper-0.2.4.tar.gz.

File metadata

Download URL: lncrawl_scraper-0.2.4.tar.gz
Upload date: Jun 28, 2026
Size: 177.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lncrawl_scraper-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`cb65334bc87f264a8ca28608995a258ccf381aa349b8042768f56e44ce358568`
MD5	`4d996e10a2f0df741bda34718b00661c`
BLAKE2b-256	`8a84e27e2a884d9559df20aa1c39a5ec5b9f31c50597512b1594f15a7a21d083`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lncrawl_scraper-0.2.4.tar.gz:

Publisher: publish.yml on lncrawl/scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lncrawl_scraper-0.2.4.tar.gz
- Subject digest: cb65334bc87f264a8ca28608995a258ccf381aa349b8042768f56e44ce358568
- Sigstore transparency entry: 1994264728
- Sigstore integration time: Jun 28, 2026
Source repository:
- Permalink: lncrawl/scraper@a5c0fdbdad7c6178d1fe5cf22d9b4ee18158c226
- Branch / Tag: refs/heads/main
- Owner: https://github.com/lncrawl
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a5c0fdbdad7c6178d1fe5cf22d9b4ee18158c226
- Trigger Event: workflow_dispatch

File details

Details for the file lncrawl_scraper-0.2.4-py3-none-any.whl.

File metadata

Download URL: lncrawl_scraper-0.2.4-py3-none-any.whl
Upload date: Jun 28, 2026
Size: 52.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lncrawl_scraper-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`013eaed909f0a80089042f74c48fa2346597f572c537dede49ddf9e8133e07fe`
MD5	`e3fdbcc79a0f64343bd22f03bfd6d598`
BLAKE2b-256	`6cb91aceaf0cbdd47ddbaa2aa621847097141e94f73ac5502cffeb9c3e0b4bab`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lncrawl_scraper-0.2.4-py3-none-any.whl:

Publisher: publish.yml on lncrawl/scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lncrawl_scraper-0.2.4-py3-none-any.whl
- Subject digest: 013eaed909f0a80089042f74c48fa2346597f572c537dede49ddf9e8133e07fe
- Sigstore transparency entry: 1994264834
- Sigstore integration time: Jun 28, 2026
Source repository:
- Permalink: lncrawl/scraper@a5c0fdbdad7c6178d1fe5cf22d9b4ee18158c226
- Branch / Tag: refs/heads/main
- Owner: https://github.com/lncrawl
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a5c0fdbdad7c6178d1fe5cf22d9b4ee18158c226
- Trigger Event: workflow_dispatch

lncrawl-scraper 0.2.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

LNCrawl Scraper

Features

Installation

Quick start

Examples

Configuration

Browser fingerprint impersonation

Browser-assisted clearance

Scraper API

PageSoup API

Development

Testing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`Scraper` API

`PageSoup` API