HTTP scraper with Cloudflare bypass, browser fingerprint impersonation, stealth mode, proxy support, and a null-safe BeautifulSoup wrapper.
Project description
LNCrawl Scraper
HTTP scraper with Cloudflare bypass, browser fingerprint impersonation, stealth mode, proxy support, and a null-safe BeautifulSoup wrapper.
Features
- Cloudflare bypass — handles CF challenges v1, v2, v3, and Turnstile transparently
- Browser fingerprint impersonation — optional
curl_cffitransport that reproduces a real Chrome/Firefox TLS (JA3/JA4) and HTTP/2 fingerprint - Browser-assisted clearance — reuse a
cf_clearancecookie solved by a real browser for managed-challenge / Turnstile sites - Accurate Client Hints —
sec-ch-ua/sec-fetch-*derived from the chosen UA - Stealth mode — human-like delays, randomized headers, browser quirks
- Proxy support — round-robin proxy rotation with Tor integration and direct fallback
- Rate limiting — configurable per-request intervals and concurrency cap
PageSoup— null-safe BeautifulSoup wrapper; selection methods never returnNone- HTTP helpers —
get_soup,get_json,get_image,get_file, and more
Installation
pip install lncrawl-scraper
# optional extras:
pip install "lncrawl-scraper[impersonate]" # browser TLS/HTTP-2 impersonation (curl_cffi)
pip install "lncrawl-scraper[image]" # get_image() support (Pillow)
Quick start
from scraper import Scraper
s = Scraper(origin="https://example.com")
# HTML
soup = s.get_soup("https://example.com/page")
title = soup.select_one("h1.title").text # "" if not found, never raises
links = [a["href"] for a in soup.select("a")]
# JSON
data = s.get_json("https://example.com/api/data")
# File download
s.get_file("https://example.com/file.zip", output_file="file.zip")
# Image (returns PIL.Image)
img = s.get_image("https://example.com/cover.jpg")
Examples
Runnable examples live in examples/ — run any with
uv run python examples/<file>.py.
| Example | Shows |
|---|---|
| 01_basic_html.py | Fetch a page and extract data with get_soup / PageSoup |
| 02_pagesoup_parsing.py | PageSoup tour: CSS select, attrs, navigation, XPath |
| 03_json_api.py | get_json / post_json and raw Response access |
| 04_files_and_images.py | get_file (streamed, atomic) and get_image (Pillow) |
| 05_forms_cookies_headers.py | submit_form, set_header, set_cookie, reset |
| 06_configuration.py | ScraperConfig, default_config(), stealth, browser identity |
| 07_impersonation.py | Real browser TLS/HTTP-2 fingerprint via impersonate |
| 08_browser_clearance.py | Reuse a cf_clearance solved by a real browser |
| 09_proxies_and_tor.py | Proxy rotation and Tor identity refresh |
| 10_concurrency_and_abort.py | Threaded fetches and cooperative abort() |
| 11_error_handling.py | HTTP, Cloudflare, and abort error handling |
Configuration
Pass a ScraperConfig for full control:
from scraper import Scraper
from scraper.config import ScraperConfig, ProxyConfig, StealthConfig, BrowserConfig
config = ScraperConfig(
min_request_interval=2.0,
max_concurrent_requests=1,
rotate_tls_ciphers=True,
stealth=StealthConfig(
enabled=True,
min_delay=1.0,
max_delay=3.0,
human_like_delays=True,
randomize_headers=True,
browser_quirks=True,
),
proxy=ProxyConfig(
proxy_urls=["http://proxy1:8080", "http://proxy2:8080"],
fallback_to_direct=True,
),
browser=BrowserConfig(browser="firefox", platform="windows", desktop=True),
)
s = Scraper(origin="https://example.com", config=config)
Or start from the library's tuned defaults and tweak:
from scraper import Scraper, default_config
config = default_config()
config.max_concurrent_requests = 4
s = Scraper(origin="https://example.com", config=config)
Browser fingerprint impersonation
A plain requests stack has a fixed OpenSSL TLS fingerprint and only speaks
HTTP/1.1 — both of which modern Cloudflare detects. Set impersonate (requires
the impersonate extra) to route requests through curl_cffi, reproducing a
real browser's TLS (JA3/JA4) and HTTP/2 fingerprint:
from scraper import Scraper, default_config
config = default_config()
config.impersonate = "chrome" # or "firefox", "chrome124", "safari", …
s = Scraper(origin="https://example.com", config=config)
The spoofed User-Agent family and Client Hints are aligned with the impersonation target automatically.
Browser-assisted clearance
For managed challenges / Turnstile that can't be solved headlessly, solve the
challenge once in a real browser (e.g. nodriver/Playwright), then hand the
cf_clearance cookie and the browser's exact User-Agent to the session:
s.apply_browser_clearance(
"https://protected.example.com",
cf_clearance="<value from the browser>",
user_agent="<the browser's exact UA>",
cookies={"__cf_bm": "<optional>"},
)
Scraper API
| Method | Description |
|---|---|
get(url, **kwargs) |
GET request, returns Response |
post(url, **kwargs) |
POST request, returns Response |
ping(url, timeout=5) |
HEAD request for reachability check |
submit_form(url, data, ...) |
POST with form encoding or multipart |
get_json(url, headers, ...) |
GET and parse response as JSON |
post_json(url, data, ...) |
POST and parse response as JSON |
get_soup(url, headers, ...) |
GET and return a PageSoup |
post_soup(url, data, ...) |
POST and return a PageSoup |
get_image(url, ...) |
GET and return a PIL.Image |
get_file(url, output_file, ...) |
Stream download to file (abort-safe) |
make_soup(data, encoding, ...) |
Parse Response, bytes, or str into PageSoup |
set_header(key, value) |
Set a default session header |
set_cookie(name, value) |
Set a session cookie |
reset() |
Clear cookies, headers, and state |
PageSoup API
PageSoup wraps a BeautifulSoup Tag. Every selection method returns a PageSoup (never None); an empty PageSoup is falsy and returns safe defaults for all operations.
soup = s.get_soup("https://example.com")
# Selection
soup.select("ul li") # → List[PageSoup]
soup.select_one(".title") # → PageSoup (empty if not found)
soup.find("div", class_="content") # → PageSoup
soup.find_all("a") # → List[PageSoup]
soup.xpath("//div[@class='body']") # → List[PageSoup]
soup.closest(".container") # → nearest matching ancestor
soup.parents(".wrapper") # → generator of matching ancestors
# Attribute access
el["href"] # get_attr shorthand, returns "" if missing
el.get_attr("src", default="/")
el.has_attr("data-id")
# Text / HTML
el.text # stripped text, always str
el.get_text(separator="\n")
el.inner_html
el.outer_html
# Navigation
el.parent
el.children # List[PageSoup], excludes text nodes
el.next_sibling
el.previous_sibling
# Mutation
soup.decompose(".ads") # remove elements matching selector
el.replace_with(new_el)
el.append(child)
Development
uv is required. Clone the repo and install all dependencies including dev extras:
git clone https://github.com/lncrawl/scraper.git
cd scraper
uv sync --all-groups --all-extras
Tasks are managed with poethepoet:
| Command | Description |
|---|---|
uv run poe lint |
Run ruff + pyright |
uv run poe lint-fix |
Auto-fix ruff violations and reformat |
uv run poe test |
Run the test suite |
uv run poe build |
Lint → test → build wheel |
uv run poe publish |
Build → publish to PyPI |
Testing
Tests live in tests/ and run with pytest:
uv run poe test
# or directly
uv run pytest
uv run pytest -v # verbose
uv run pytest tests/test_dummy.py # a single file
Mock HTTP with responses (a dev dependency) so tests make no real network calls.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lncrawl_scraper-0.1.0.tar.gz.
File metadata
- Download URL: lncrawl_scraper-0.1.0.tar.gz
- Upload date:
- Size: 167.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ca0440abfde49e7b071b8572340e1799fdd99ab19d2a5a30928b794f9ff751e
|
|
| MD5 |
6cc9b22de4244c317fdad95fd3f5447f
|
|
| BLAKE2b-256 |
23812efeb4a3f74aa7d2b7d3e55cbd9c87792fc67dc7d58bcd011269df413415
|
File details
Details for the file lncrawl_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lncrawl_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 52.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d52b089d8924a232d4494fbb929fdbec3d04d9ab3c14c491e976f8d47c8127f
|
|
| MD5 |
b7f518ae7cbaf0c2170c4ecf2f05baff
|
|
| BLAKE2b-256 |
79c0f1b8b47cfe5aa4e7f1e466702b0d2490e93287282d95593746cc91010263
|