The Swiss Army Knife for Web Scraping, Search, and Browser Automation - dual Selenium + Playwright engines under one clean API.
Project description
Naked Web
The Swiss Army Knife for Web Scraping, Search, and Browser Automation
Dual-engine power: Selenium + Playwright - unified under one clean API.
Installation • Quick Start • Features • Selenium • Playwright • Search • Crawler • Config
What is Naked Web?
Naked Web is a production-grade Python toolkit that combines web scraping, search, and full browser automation into a single cohesive library. It wraps two powerful browser engines - Selenium (via undetected-chromedriver) and Playwright - so you can pick the right tool for every job without juggling separate libraries.
| Capability | Engine | Use Case |
|---|---|---|
| HTTP Scraping | requests + BeautifulSoup |
Fast, lightweight page fetching |
| JS Rendering | Selenium (undetected-chromedriver) | Bot-protected sites, stealth scraping |
| Browser Automation | Playwright | Click, type, scroll, extract - full control |
| Google Search | Google CSE JSON API | Search with optional content enrichment |
| Site Crawling | Built-in BFS crawler | Multi-page crawling with depth/duration limits |
Why Naked Web?
- Two engines, one API - Selenium for stealth, Playwright for automation. No need to choose.
- Anti-detection built in - CDP script injection, mouse simulation, realistic scrolling, profile persistence.
- Zero-vision automation - Playwright's
AutoBrowserindexes every interactive element by number. Click[3], type into[7]- no screenshots, no coordinates, no CSS selectors needed. - Structured extraction - Meta tags, headings, paragraphs, inline styles, assets with rich context metadata.
- HTML pagination - Line-based and character-based chunking for feeding content to LLMs.
- Pydantic models everywhere - Typed, validated, serializable data from every operation.
Installation
# Core (HTTP scraping, search, content extraction, crawling)
pip install -e .
# + Selenium engine (stealth scraping, JS rendering, bot bypass)
pip install -e ".[selenium]"
# + Playwright engine (browser automation, DOM interaction)
pip install -e ".[automation]"
playwright install chromium
# Everything
pip install -e ".[selenium,automation]"
playwright install chromium
Requirements: Python 3.9+
Core dependencies: requests, beautifulsoup4, lxml, pydantic
Features at a Glance
Scraping & Fetching
- Plain HTTP fetch with
requests+BeautifulSoup - Selenium JS rendering with undetected-chromedriver
- Enhanced stealth mode (CDP injection, mouse simulation, realistic scrolling)
- Persistent browser profiles for bot detection bypass
robots.txtcompliance (optional)- Configurable timeouts, delays, and user agents
Browser Automation (Playwright)
- Launch Chromium, Firefox, or WebKit
- Navigate, click, type, scroll, send keyboard shortcuts
- DOM state extraction with indexed interactive elements
- Content extraction as clean Markdown
- Link extraction across the page
- Dropdown selection, screenshots, JavaScript execution
- Multi-tab management (open, switch, close, list)
- Persistent profile support (cookies, localStorage survive sessions)
Search & Discovery
- Google Custom Search JSON API integration
- Automatic content enrichment per search result
- Optional JS rendering for search result pages
Content Extraction
- Structured bundles: meta tags, headings, paragraphs, inline styles, CSS/font links
- Asset harvesting: stylesheets, scripts, images, media, fonts, links
- Rich context metadata per asset (alt text, captions, snippets, anchor text, source position)
Crawling & Analysis
- Breadth-first site crawler with depth, page count, and duration limits
- Configurable crawl delays to avoid rate limiting
- Regex/glob pattern search across crawled page text and HTML
- Asset pattern matching with contextual windows
Pagination
- Line-based HTML chunking with
next_start/has_morecursors - Character-based HTML chunking for LLM-sized windows
- Works on both HTML snapshots and raw text
Quick Start
from naked_web import NakedWebConfig, fetch_page
cfg = NakedWebConfig()
# Simple HTTP fetch
snap = fetch_page("https://example.com", cfg=cfg)
print(snap.text[:500])
print(snap.assets.images)
# With Selenium JS rendering
snap = fetch_page("https://example.com", cfg=cfg, use_js=True)
# With full stealth mode (bot-protected sites)
snap = fetch_page("https://example.com", cfg=cfg, use_stealth=True)
Scraping Engine (Selenium)
NakedWeb's Selenium integration uses undetected-chromedriver with layered anti-detection measures. Perfect for sites like Reddit, LinkedIn, and other bot-protected targets.
Basic JS Rendering
from naked_web import fetch_page, NakedWebConfig
cfg = NakedWebConfig()
snap = fetch_page("https://reddit.com/r/Python/", cfg=cfg, use_js=True)
print(snap.text[:500])
Stealth Mode
When use_stealth=True, NakedWeb activates the full anti-detection suite:
snap = fetch_page("https://reddit.com/r/Python/", cfg=cfg, use_stealth=True)
What stealth mode does:
| Layer | Technique |
|---|---|
| CDP Injection | Masks navigator.webdriver, mocks plugins, languages, and permissions |
| Mouse Simulation | Random, human-like cursor movements across the viewport |
| Realistic Scrolling | Variable-speed scrolling with pauses and occasional scroll-backs |
| Enhanced Headers | Proper Accept-Language, viewport config, plugin mocking |
| Profile Persistence | Reuse cookies, history, and cache across sessions |
Advanced: Direct Driver Control
from naked_web.utils.stealth import setup_stealth_driver, inject_stealth_scripts
from naked_web import NakedWebConfig
cfg = NakedWebConfig(
selenium_headless=False,
selenium_window_size="1920,1080",
humanize_delay_range=(1.5, 3.5),
)
driver = setup_stealth_driver(cfg, use_profile=False)
try:
driver.get("https://example.com")
html = driver.page_source
finally:
driver.quit()
Stealth Fetch Helper
from naked_web.utils.stealth import fetch_with_stealth
from naked_web import NakedWebConfig
cfg = NakedWebConfig(
selenium_headless=False,
humanize_delay_range=(1.5, 3.5),
)
html, headers, status, final_url = fetch_with_stealth(
"https://www.reddit.com/r/Python/",
cfg=cfg,
perform_mouse_movements=True,
perform_realistic_scrolling=True,
)
print(f"Fetched {len(html)} chars from {final_url}")
Browser Profile Persistence
Fresh browsers are a red flag for bot detectors. NakedWeb supports persistent browser profiles so cookies, history, and cache survive across sessions.
Warm up a profile:
# Create a default profile with organic browsing history
python scripts/warmup_profile.py
# Custom profile with longer warm-up
python scripts/warmup_profile.py --profile "profiles/reddit" --duration 3600
Use the warmed profile:
cfg = NakedWebConfig() # Uses default warmed profile automatically
snap = fetch_page("https://www.reddit.com/r/Python/", cfg=cfg, use_js=True)
Custom profile path:
cfg = NakedWebConfig(selenium_profile_path="profiles/reddit")
snap = fetch_page("https://www.reddit.com/r/Python/", cfg=cfg, use_js=True)
Profile rotation for heavy workloads:
import random
from pathlib import Path
profiles = list(Path("profiles").glob("reddit_*"))
cfg = NakedWebConfig(
selenium_profile_path=str(random.choice(profiles)),
crawl_delay_range=(10.0, 30.0),
)
Profiles store cookies, history, localStorage, cache, and more. Keep them secure and don't commit them to version control.
Automation Engine (Playwright)
The AutoBrowser class provides full browser automation powered by Playwright. It extracts every interactive element on the page and assigns each a numeric index - so you can click, type, and interact without writing CSS selectors or using vision models.
Launch and Navigate
from naked_web.automation import AutoBrowser
browser = AutoBrowser(headless=True, browser_type="chromium")
browser.launch()
browser.navigate("https://example.com")
DOM State Extraction
Get a structured snapshot of every interactive element on the page:
state = browser.get_state()
print(state.to_text())
Example output:
URL: https://example.com
Title: Example Domain
Scroll: 0% (800px viewport, 1200px total)
Interactive elements (3 total):
[1] a "More information..." -> https://www.iana.org/domains/example
[2] input type="text" placeholder="Search..."
[3] button "Submit"
Interact by Index
browser.click(1) # Click element [1]
browser.type_text(2, "hello world") # Type into element [2]
browser.scroll(direction="down", amount=2) # Scroll down 2 pages
browser.send_keys("Enter") # Press Enter
browser.select_option(4, "Option A") # Select dropdown option
Extract Content
# Page content as clean Markdown
result = browser.extract_content()
print(result.extracted_content)
# All links on the page
links = browser.extract_links()
print(links.extracted_content)
# Take a screenshot
browser.screenshot("page.png")
# Run arbitrary JavaScript
result = browser.evaluate_js("document.title")
print(result.extracted_content)
Multi-Tab Management
browser.new_tab("https://google.com") # Open new tab
tabs = browser.list_tabs() # List all tabs
browser.switch_tab(0) # Switch to first tab
browser.close_tab(1) # Close second tab
Persistent Profiles (Playwright)
Stay logged in across sessions:
browser = AutoBrowser(
headless=False,
user_data_dir="profiles/my_session",
browser_type="chromium",
)
browser.launch()
# Cookies, localStorage, history all persist to disk
browser.navigate("https://example.com")
# ... interact ...
browser.close() # Data flushed to profile directory
Supported Browsers
| Engine | Install Command |
|---|---|
| Chromium | playwright install chromium |
| Firefox | playwright install firefox |
| WebKit | playwright install webkit |
browser = AutoBrowser(browser_type="firefox")
Full AutoBrowser API
| Method | Description |
|---|---|
launch() |
Start the browser |
close() |
Close browser and clean up |
navigate(url) |
Go to a URL |
go_back() |
Navigate back in history |
get_state(max_elements) |
Extract interactive DOM elements with indices |
click(index) |
Click element by index |
type_text(index, text, clear) |
Type into an input element |
scroll(direction, amount) |
Scroll up/down by pages |
send_keys(keys) |
Send keyboard shortcuts |
select_option(index, value) |
Select dropdown option |
wait(seconds) |
Wait for dynamic content |
extract_content() |
Extract page as Markdown |
extract_links() |
Extract all page links |
screenshot(path) |
Save screenshot to file |
evaluate_js(expression) |
Run JavaScript in page |
new_tab(url) |
Open a new tab |
switch_tab(tab_index) |
Switch to a tab |
close_tab(tab_index) |
Close a tab |
list_tabs() |
List all open tabs |
Google Search Integration
Search the web via Google Custom Search JSON API with optional page content enrichment:
from naked_web import SearchClient, NakedWebConfig
cfg = NakedWebConfig(
google_api_key="YOUR_KEY",
google_cse_id="YOUR_CSE_ID",
)
client = SearchClient(cfg)
# Basic search
resp = client.search("python web scraping", max_results=5)
for r in resp["results"]:
print(f"{r['title']} - {r['url']}")
# Search + fetch page content for each result
resp = client.search(
"python selenium scraping",
max_results=3,
include_page_content=True,
use_js_for_pages=False,
)
Each result contains: title, snippet, url, score, and optionally content, status_code, final_url.
Structured Content Extraction
Pull structured data from any fetched page:
from naked_web import fetch_page, extract_content, NakedWebConfig
cfg = NakedWebConfig()
snap = fetch_page("https://example.com", cfg=cfg)
bundle = extract_content(
snap,
include_meta=True,
include_headings=True,
include_paragraphs=True,
include_inline_styles=True,
include_links=True,
)
print(bundle.title)
print(bundle.meta) # List of MetaTag objects
print(bundle.headings) # List of HeadingBlock objects (level + text)
print(bundle.paragraphs) # List of paragraph strings
print(bundle.css_links) # Stylesheet URLs
print(bundle.font_links) # Font URLs
print(bundle.inline_styles) # Raw CSS from <style> tags
One-Shot: Fetch + Extract + Paginate
from naked_web import collect_page
package = collect_page(
"https://example.com",
use_js=True,
include_line_chunks=True,
include_char_chunks=True,
line_chunk_size=250,
char_chunk_size=4000,
pagination_chunk_limit=5,
)
Asset Harvesting
Every fetched page comes with a full PageAssets breakdown:
snap = fetch_page("https://example.com", cfg=cfg)
snap.assets.stylesheets # CSS file URLs
snap.assets.scripts # JS file URLs
snap.assets.images # Image URLs (including srcset)
snap.assets.media # Video/audio URLs
snap.assets.fonts # Font file URLs (.woff, .woff2, .ttf, etc.)
snap.assets.links # All anchor href URLs
Each category also has a *_details list with rich AssetContext metadata:
for img in snap.assets.image_details:
print(img.url) # Resolved absolute URL
print(img.alt) # Alt text
print(img.caption) # figcaption text (if inside <figure>)
print(img.snippet) # Raw HTML snippet of the tag
print(img.context) # Surrounding text content
print(img.position) # Source line number
print(img.attrs) # All HTML attributes as dict
Download Assets
from naked_web import download_assets
download_assets(snap, output_dir="./mirror/assets", cfg=cfg)
HTML Pagination
Split large HTML into manageable chunks for LLM consumption:
from naked_web import get_html_lines, get_html_chars, slice_text_lines, slice_text_chars
# Line-based pagination
chunk = get_html_lines(snap, start_line=0, num_lines=50)
print(chunk["content"])
print(chunk["has_more"]) # True if more lines exist
print(chunk["next_start"]) # Starting line for next chunk
# Character-based pagination
chunk = get_html_chars(snap, start=0, length=4000)
print(chunk["content"])
print(chunk["next_start"])
# Also works on raw text strings
chunk = slice_text_lines("your raw text here", start_line=0, num_lines=100)
chunk = slice_text_chars("your raw text here", start=0, length=5000)
Site Crawler
Breadth-first crawler with fine-grained controls:
from naked_web import crawl_site, NakedWebConfig
cfg = NakedWebConfig(crawl_delay_range=(1.0, 2.5))
pages = crawl_site(
"https://example.com",
cfg=cfg,
max_pages=20,
max_depth=3,
max_duration=60, # Stop after 60 seconds
same_domain_only=True,
use_js=False,
delay_range=(0.5, 1.5), # Override per-crawl delay
)
for url, snapshot in pages.items():
print(f"{url} - {snapshot.status_code} - {len(snapshot.text)} chars")
Pattern Search Across Crawled Pages
from naked_web import find_text_matches, find_asset_matches
# Search page text with regex or glob patterns
text_hits = find_text_matches(
pages,
patterns=["*privacy*", r"cookie\s+policy"],
use_regex=True,
context_chars=90,
)
# Search asset metadata
asset_hits = find_asset_matches(
pages,
patterns=["*.css", "*analytics*"],
context_chars=140,
)
for url, matches in text_hits.items():
print(f"{url}: {len(matches)} matches")
Configuration
All settings live on NakedWebConfig:
from naked_web import NakedWebConfig
cfg = NakedWebConfig(
# --- Google Search ---
google_api_key="YOUR_KEY",
google_cse_id="YOUR_CSE_ID",
# --- HTTP ---
user_agent="Mozilla/5.0 ...",
request_timeout=20,
max_text_chars=20000,
respect_robots_txt=False,
# --- Assets ---
max_asset_bytes=5_000_000,
asset_context_chars=320,
# --- Selenium ---
selenium_headless=False,
selenium_window_size="1366,768",
selenium_page_load_timeout=35,
selenium_wait_timeout=15,
selenium_profile_path=None, # Path to persistent Chrome profile
# --- Humanization ---
humanize_delay_range=(1.25, 2.75),
crawl_delay_range=(1.0, 2.5),
)
| Setting | Default | Description |
|---|---|---|
user_agent |
Chrome 120 UA string | HTTP and Selenium user agent |
request_timeout |
20 |
HTTP request timeout (seconds) |
max_text_chars |
20000 |
Max cleaned text characters per page |
respect_robots_txt |
False |
Check robots.txt before fetching |
selenium_headless |
False |
Run Chrome in headless mode |
selenium_window_size |
1366,768 |
Browser viewport dimensions |
selenium_page_load_timeout |
35 |
Selenium page load timeout (seconds) |
selenium_wait_timeout |
15 |
Selenium element wait timeout (seconds) |
selenium_profile_path |
None |
Persistent browser profile directory |
humanize_delay_range |
(1.25, 2.75) |
Random delay before navigation/scroll (seconds) |
crawl_delay_range |
(1.0, 2.5) |
Delay between crawler page fetches (seconds) |
asset_context_chars |
320 |
Characters of HTML context captured per asset |
max_asset_bytes |
5000000 |
Max size for downloaded assets |
Scripts & Testing
# Live fetch test - verify HTTP, JS rendering, and pagination
python scripts/live_fetch_test.py https://example.com --mode both --inline-styles --output payload.json
# Smoke test - quick sanity check
python scripts/smoke_test.py
# Stealth test against bot detection
python scripts/stealth_test.py
python scripts/stealth_test.py "https://www.reddit.com/r/Python/" --no-headless
python scripts/stealth_test.py --no-mouse --no-scroll --output reddit.html
# Profile warm-up
python scripts/warmup_profile.py
python scripts/warmup_profile.py --profile profiles/reddit --duration 1800
Architecture
naked_web/
__init__.py # Public API surface
scrape.py # HTTP fetch, Selenium rendering, asset extraction
search.py # Google Custom Search client
content.py # Structured content extraction
crawler.py # BFS site crawler + pattern search
pagination.py # Line/char-based HTML pagination
core/
config.py # NakedWebConfig dataclass
models.py # Pydantic models (PageSnapshot, PageAssets, etc.)
utils/
browser.py # Selenium helpers (scroll, wait)
stealth.py # Anti-detection (CDP injection, mouse, scrolling)
text.py # Text cleaning utilities
timing.py # Delay/jitter helpers
automation/ # Playwright-based browser automation
browser.py # AutoBrowser class
actions.py # Click, type, scroll, extract, screenshot
state.py # DOM state extraction engine
models.py # ActionResult, PageState, InteractiveElement, TabInfo
Public API Reference
Core Scraping
| Export | Description |
|---|---|
NakedWebConfig |
Global configuration dataclass |
fetch_page(url, cfg, use_js, use_stealth) |
Fetch a single page (HTTP / Selenium / Stealth) |
download_assets(snapshot, output_dir, cfg) |
Download assets from a snapshot to disk |
extract_content(snapshot, ...) |
Extract structured content bundle |
collect_page(url, ...) |
One-shot fetch + extract + paginate |
Search
| Export | Description |
|---|---|
SearchClient(cfg) |
Google Custom Search with content enrichment |
Crawling
| Export | Description |
|---|---|
crawl_site(url, cfg, ...) |
BFS crawler with depth/duration/throttle controls |
find_text_matches(pages, patterns, ...) |
Regex/glob search across crawled page text |
find_asset_matches(pages, patterns, ...) |
Regex/glob search across asset metadata |
Pagination
| Export | Description |
|---|---|
get_html_lines(snapshot, start_line, num_lines) |
Line-based HTML pagination |
get_html_chars(snapshot, start, length) |
Character-based HTML pagination |
slice_text_lines(text, start_line, num_lines) |
Line-based raw text pagination |
slice_text_chars(text, start, length) |
Character-based raw text pagination |
Stealth (Selenium)
| Export | Description |
|---|---|
fetch_with_stealth(url, cfg, ...) |
Full stealth fetch with humanization |
setup_stealth_driver(cfg, ...) |
Create a stealth-configured Chrome driver |
inject_stealth_scripts(driver) |
Inject CDP anti-detection scripts |
random_mouse_movement(driver) |
Simulate human-like mouse movements |
random_scroll_pattern(driver) |
Simulate realistic scrolling behavior |
Automation (Playwright)
| Export | Description |
|---|---|
AutoBrowser |
Full browser automation controller |
BrowserActionResult |
Result model for browser actions |
PageState |
Page state with indexed interactive elements |
InteractiveElement |
Single interactive DOM element model |
TabInfo |
Browser tab information model |
Models
| Export | Description |
|---|---|
PageSnapshot |
Complete page fetch result (HTML, text, assets, metadata) |
PageAssets |
Categorized asset URLs with context details |
AssetContext |
Rich metadata for a single asset |
PageContentBundle |
Structured content (meta, headings, paragraphs, styles) |
MetaTag |
Parsed meta tag |
HeadingBlock |
Heading level + text |
LineSlice / CharSlice |
Pagination result models |
SearchResult |
Single search result entry |
Limitations & Notes
- TLS fingerprinting - Chrome's TLS signature can be identified by advanced detectors.
- Canvas/WebGL - GPU rendering patterns may differ in automated contexts.
- IP reputation - Datacenter IPs are often flagged. Consider residential proxies for heavy use.
- Selenium and Playwright are optional - Core HTTP scraping works without either engine installed.
- Google Search requires API keys - Get them from the Google Custom Search Console.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file naked_web-1.3.0.tar.gz.
File metadata
- Download URL: naked_web-1.3.0.tar.gz
- Upload date:
- Size: 39.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65a0dcc134f2f9a67657a0dfc34b3c61b5fc9f83b9412f60c86e6fb53c3c19a7
|
|
| MD5 |
37ef9076405b88139353e228e4f33880
|
|
| BLAKE2b-256 |
f87879f8334381a31027c41e50648bc745406aebf6521d36afc060ae0fdbd874
|
File details
Details for the file naked_web-1.3.0-py3-none-any.whl.
File metadata
- Download URL: naked_web-1.3.0-py3-none-any.whl
- Upload date:
- Size: 45.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45ad3fae29cbf91298c10ef417aedbb06435742dfb765b98a4c6e54f881f3911
|
|
| MD5 |
d1d04d519775f5b4211ba5544ae87a5c
|
|
| BLAKE2b-256 |
2b1fb2981c7f433e20b37d691755f136778e30ffb2f5669f3ab55d714b044555
|