Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content — with managed sessions, TLS/user-agent control, and a single shared request pipeline.
Project description
Abstract WebTools
Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content.
Abstract WebTools wraps the messy parts of web access — HTTP sessions, TLS/cipher
configuration, user‑agent rotation, retries, HTML parsing, link extraction,
crawling, headless browsers, and media downloading — behind a set of small,
composable managers. The managers share a single URL → request → soup
pipeline, so a page is fetched once and reused everywhere downstream instead
of being re‑fetched by every layer.
- Author: putkoff (Abstract Endeavors)
- Source: https://github.com/AbstractEndeavors/abstract_webtools
- Python: 3.8+
- License: MIT
Table of contents
- Why
- Install
- Quick start
- Architecture: the manager chain
- The managers
- Common recipes
- Design notes
- Testing
- Contributing
Why
Most scraping code re‑implements the same plumbing on every project: building a session, picking a user agent, tuning TLS so a server doesn't reject you, handling retries, parsing HTML, then doing it all again for the next step.
Abstract WebTools factors each concern into a manager and threads shared
instances through the chain. Pass an existing req_mgr (or source code) into
any higher‑level manager and it is reused as‑is — no rebuild, no second network
request.
Install
pip install abstract_webtools
Optional extras:
pip install "abstract_webtools[drivers]" # selenium + webdriver-manager
pip install "abstract_webtools[media]" # yt-dlp + m3u8 for video downloads
pip install "abstract_webtools[gui]" # PyQt/PySimpleGUI helpers
Core runtime deps: requests, urllib3, beautifulsoup4. Browser and media
features pull in selenium / playwright / yt-dlp as needed.
Quick start
from abstract_webtools import get_soup, get_source, linkManager
# Fetch + parse a page (one request, reused internally)
soup = get_soup("https://example.com")
print(soup.title.text)
# Just the raw HTML
html = get_source("https://example.com")
# All links + image links on the page
lm = linkManager("https://example.com")
print(lm.all_desired_links)
print(lm.all_desired_image_links)
Architecture: the manager chain
The core managers form a layered pipeline. Each layer accepts the layer(s) below it and reuses them when provided:
urlManager normalize / validate / vary URLs
└─ requestManager sessions, retries, TLS, UA ── networkManager ┐
└─ soupManager BeautifulSoup parsing ├─ userAgentManager
├─ linkManager link / image extraction ├─ cipherManager
└─ crawlManager site crawling / sitemaps └─ sslManager + tlsAdapter
Every layer has a matching factory function that detects and reuses an existing instance:
| Factory | Returns | Reuses when given |
|---|---|---|
get_url_mgr(url=, url_mgr=) |
urlManager |
url_mgr |
get_req_mgr(url=, url_mgr=, source_code=, req_mgr=) |
requestManager |
req_mgr |
get_source(...) |
HTML string | source_code / req_mgr |
get_soup_mgr(...) |
soupManager |
soup_mgr / req_mgr |
get_soup(...) |
BeautifulSoup |
soup / soup_mgr / source_code |
get_crawl_mgr(...) |
crawlManager |
req_mgr / url_mgr |
get_managed_session(...) |
requests.Session |
req_mgr |
Because every factory short‑circuits on an instance you pass in, the whole chain is built once and shared:
from abstract_webtools import get_req_mgr, get_soup_mgr, linkManager
req = get_req_mgr("https://example.com") # fetches once
soup_mgr = get_soup_mgr(req_mgr=req) # no re-fetch
links = linkManager(req_mgr=req) # no re-fetch
The managers
| Manager | Responsibility |
|---|---|
| urlManager | Parse, validate, normalize and generate URL variants. |
| requestManager | requests.Session with retries, timeouts, TLS adapter, UA, proxies, cookies; optional Selenium fallback. |
| networkManager | Mounts the TLS adapter and wires proxies/cookies/UA into the session. |
| userAgentManager | Realistic user agents and per‑URL headers (random or pinned by OS/browser). |
| cipherManager | Cipher‑suite strings for TLS. |
| sslManager / tlsAdapter | SSL context + HTTPAdapter for fine‑grained TLS control. |
| soupManager | BeautifulSoup parsing, meta/link extraction, attribute discovery. |
| linkManager | Internal/image link extraction with desired/undesired filters. |
| crawlManager | Recursive crawling, sitemap generation, domain link discovery. |
| middleManager | UnifiedWebManager — one lazy facade over the whole chain. |
| usurpManager | Full‑site mirror: pages + assets + styles, references rewritten for offline use. |
| videoDownloader | Video/media download via yt-dlp / m3u8, wired to the managed session/UA. |
| seleneumManager / playwriteManager | Headless‑browser source fetching for JS‑rendered pages. |
Common recipes
Get a page's source / soup
from abstract_webtools import get_source, get_soup, get_soup_mgr
html = get_source("https://example.com")
soup = get_soup("https://example.com")
# Reuse already-fetched HTML — no network call
soup2 = get_soup(source_code=html)
# Soup manager exposes parsing helpers
sm = get_soup_mgr("https://example.com")
print(sm.get_all_attribute_values(tags_list=["a", "img"]))
Extract links
from abstract_webtools import linkManager
lm = linkManager(
"https://example.com",
link_attr_value_desired=["/blog/"], # keep only links containing this
image_link_tags="img",
)
print(lm.all_desired_links)
print(lm.find_all_domain()) # unique domains found
Crawl a site
from abstract_webtools import get_crawl_mgr, get_domain_crawl
crawl = get_crawl_mgr("https://example.com")
domain_links = get_domain_crawl("https://example.com", max_depth=3)
One shared context: UnifiedWebManager
UnifiedWebManager lazily builds and caches url_mgr, req_mgr, source_code,
soup_mgr, soup, plus link_mgr / crawl_mgr — all over a single fetch.
from abstract_webtools import UnifiedWebManager
web = UnifiedWebManager("https://example.com")
web.url_mgr # built on demand
web.source_code # fetched once
web.soup # parsed once
web.link_mgr # shares the same chain — no re-fetch
web.crawl_mgr
# Or start from HTML you already have (zero network):
web = UnifiedWebManager(source_code="<html>...</html>")
web.soup.title
A managed requests.Session
Need a plain session, but configured with a real user agent, ciphers, the TLS
adapter and proxies? Ask the stack for one — it never fetches just to build it,
and reuses an existing req_mgr's session when given:
from abstract_webtools import get_managed_session
session = get_managed_session(user_agent="MyBot/1.0")
resp = session.get("https://example.com")
Mirror an entire site (usurpManager)
usurpManager saves a working offline copy of a site — pages and styles
intact. By default it recursively captures the whole site: every
same‑domain page link and all referenced media. It follows CSS url(...) /
@import (including @font-face and cross‑domain CDN fonts), handles srcset,
inline style="" and <style> blocks, downloads scripts/images/linked files,
and rewrites every reference to a relative local path so the result renders
straight from file://.
from abstract_webtools import usurpit
# Full recursive capture of the entire site (unlimited depth by default):
result = usurpit("https://example.com", output_dir="example_mirror")
print(result["output_dir"], len(result["pages"]), "pages")
Or drive it directly for more control:
from abstract_webtools import usurpManager, get_req_mgr
req = get_req_mgr("https://example.com")
site = usurpManager(
"https://example.com",
req_mgr=req, # reuse the managed session
output_dir="example_mirror",
max_depth=None, # default: unlimited (whole site); set an int to cap
mirror_external_assets=True, # pull CDN css/fonts so styles work (default)
)
summary = site.main()
- The crawl is breadth‑first and unlimited‑depth by default (
max_depth=None); the visited‑set keeps it finite/loop‑free. Pass an integermax_depthto bound it. - Pages are mirrored within the origin host; referenced assets may come from
CDNs (set
mirror_external_assets=Falseto stay strictly on‑origin). - A single
url → local pathmap keeps references consistent and shared assets are fetched exactly once. - For heavily JS‑rendered sites, fetch the rendered HTML first via
seleneumManager/playwriteManager.
Download video / media
from abstract_webtools import get_video_info, downloadvideo
info = get_video_info("https://www.youtube.com/watch?v=...") # metadata only
downloadvideo("https://www.youtube.com/watch?v=...", download_directory="videos")
The downloader pulls its user agent (and proxy) from the shared request stack
and threads them into yt-dlp, so downloads use the same identity as the rest
of your scrape. You can inject an existing req_mgr / ua_mgr:
from abstract_webtools import VideoDownloader, get_req_mgr
req = get_req_mgr("https://example.com")
VideoDownloader(url="https://example.com/video.mp4", req_mgr=req,
download_directory="videos")
Design notes
- Reuse over rebuild. Every factory and constructor honors an instance you
pass in. Supplying
source_codeor areq_mgrmeans zero extra network requests downstream. - One session, fully configured. TLS ciphers, SSL context, the HTTP adapter,
user agent, proxies and cookies are assembled once by the request stack and
reused — including by
usurpManagerandvideoDownloader. - Optional heavy deps stay optional. Browser/media/GUI extras are imported defensively so the core package imports without them.
Testing
The repo ships dependency‑light regression tests (only requests +
beautifulsoup4 required) that load the real modules under a controlled
namespace and assert the no‑refetch behavior and the site mirror:
python tests/test_manager_chain.py # url/request/soup/link chain reuse
python tests/test_video_usurp_chain.py # managed session for video + usurp
python tests/test_usurp_mirror.py # full-site mirror with styles
Contributing
Issues and PRs welcome at
AbstractEndeavors/abstract_webtools.
Please keep new functionality threaded through the shared manager chain (accept
and reuse url_mgr / req_mgr / source_code) rather than re‑fetching, and add
a dependency‑light test where practical.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abstract_webtools-0.1.6.428.tar.gz.
File metadata
- Download URL: abstract_webtools-0.1.6.428.tar.gz
- Upload date:
- Size: 125.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfa3921b3b58e0b4eecee7ffda99bea785b3af1abb28d2173ae93608dfadd79f
|
|
| MD5 |
4ad73f5aedb80cb8d3951f7ee868cd35
|
|
| BLAKE2b-256 |
609eec23f829552f80d050d2f92ab79aaa0e3a3a3b48523f5573fba3096a211f
|
File details
Details for the file abstract_webtools-0.1.6.428-py3-none-any.whl.
File metadata
- Download URL: abstract_webtools-0.1.6.428-py3-none-any.whl
- Upload date:
- Size: 166.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ad4e2d25d7fe23cdddb03f3e7aeccd78e051eb191abfe22b42761cd5de95b9d
|
|
| MD5 |
aa19e1072b3a4e5011d081fa7ede18a9
|
|
| BLAKE2b-256 |
4c338e516a0de0b9568508fc1395bae1d4b38f8ced459329baf19b36d49f9496
|