Skip to main content

Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content — with managed sessions, TLS/user-agent control, and a single shared request pipeline.

Project description

Abstract WebTools

Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content.

Abstract WebTools wraps the messy parts of web access — HTTP sessions, TLS/cipher configuration, user‑agent rotation, retries, HTML parsing, link extraction, crawling, headless browsers, and media downloading — behind a set of small, composable managers. The managers share a single URL → request → soup pipeline, so a page is fetched once and reused everywhere downstream instead of being re‑fetched by every layer.


Table of contents


Why

Most scraping code re‑implements the same plumbing on every project: building a session, picking a user agent, tuning TLS so a server doesn't reject you, handling retries, parsing HTML, then doing it all again for the next step.

Abstract WebTools factors each concern into a manager and threads shared instances through the chain. Pass an existing req_mgr (or source code) into any higher‑level manager and it is reused as‑is — no rebuild, no second network request.


Install

pip install abstract_webtools

Optional extras:

pip install "abstract_webtools[drivers]"   # selenium + webdriver-manager
pip install "abstract_webtools[media]"     # yt-dlp + m3u8 for video downloads
pip install "abstract_webtools[gui]"       # PyQt/PySimpleGUI helpers

Core runtime deps: requests, urllib3, beautifulsoup4. Browser and media features pull in selenium / playwright / yt-dlp as needed.


Quick start

from abstract_webtools import get_soup, get_source, linkManager

# Fetch + parse a page (one request, reused internally)
soup = get_soup("https://example.com")
print(soup.title.text)

# Just the raw HTML
html = get_source("https://example.com")

# All links + image links on the page
lm = linkManager("https://example.com")
print(lm.all_desired_links)
print(lm.all_desired_image_links)

Architecture: the manager chain

The core managers form a layered pipeline. Each layer accepts the layer(s) below it and reuses them when provided:

urlManager        normalize / validate / vary URLs
   └─ requestManager   sessions, retries, TLS, UA  ── networkManager ┐
        └─ soupManager       BeautifulSoup parsing                   ├─ userAgentManager
             ├─ linkManager       link / image extraction            ├─ cipherManager
             └─ crawlManager       site crawling / sitemaps          └─ sslManager + tlsAdapter

Every layer has a matching factory function that detects and reuses an existing instance:

Factory Returns Reuses when given
get_url_mgr(url=, url_mgr=) urlManager url_mgr
get_req_mgr(url=, url_mgr=, source_code=, req_mgr=) requestManager req_mgr
get_source(...) HTML string source_code / req_mgr
get_soup_mgr(...) soupManager soup_mgr / req_mgr
get_soup(...) BeautifulSoup soup / soup_mgr / source_code
get_crawl_mgr(...) crawlManager req_mgr / url_mgr
get_managed_session(...) requests.Session req_mgr

Because every factory short‑circuits on an instance you pass in, the whole chain is built once and shared:

from abstract_webtools import get_req_mgr, get_soup_mgr, linkManager

req = get_req_mgr("https://example.com")        # fetches once
soup_mgr = get_soup_mgr(req_mgr=req)            # no re-fetch
links = linkManager(req_mgr=req)                # no re-fetch

The managers

Manager Responsibility
urlManager Parse, validate, normalize and generate URL variants.
requestManager requests.Session with retries, timeouts, TLS adapter, UA, proxies, cookies; optional Selenium fallback.
networkManager Mounts the TLS adapter and wires proxies/cookies/UA into the session.
userAgentManager Realistic user agents and per‑URL headers (random or pinned by OS/browser).
cipherManager Cipher‑suite strings for TLS.
sslManager / tlsAdapter SSL context + HTTPAdapter for fine‑grained TLS control.
soupManager BeautifulSoup parsing, meta/link extraction, attribute discovery.
linkManager Internal/image link extraction with desired/undesired filters.
crawlManager Recursive crawling, sitemap generation, domain link discovery.
middleManager UnifiedWebManager — one lazy facade over the whole chain.
usurpManager Full‑site mirror: pages + assets + styles, references rewritten for offline use.
videoDownloader Video/media download via yt-dlp / m3u8, wired to the managed session/UA.
seleneumManager / playwriteManager Headless‑browser source fetching for JS‑rendered pages.

Common recipes

Get a page's source / soup

from abstract_webtools import get_source, get_soup, get_soup_mgr

html = get_source("https://example.com")
soup = get_soup("https://example.com")

# Reuse already-fetched HTML — no network call
soup2 = get_soup(source_code=html)

# Soup manager exposes parsing helpers
sm = get_soup_mgr("https://example.com")
print(sm.get_all_attribute_values(tags_list=["a", "img"]))

Extract links

from abstract_webtools import linkManager

lm = linkManager(
    "https://example.com",
    link_attr_value_desired=["/blog/"],      # keep only links containing this
    image_link_tags="img",
)
print(lm.all_desired_links)
print(lm.find_all_domain())                  # unique domains found

Crawl a site

from abstract_webtools import get_crawl_mgr, get_domain_crawl

crawl = get_crawl_mgr("https://example.com")
domain_links = get_domain_crawl("https://example.com", max_depth=3)

One shared context: UnifiedWebManager

UnifiedWebManager lazily builds and caches url_mgr, req_mgr, source_code, soup_mgr, soup, plus link_mgr / crawl_mgr — all over a single fetch.

from abstract_webtools import UnifiedWebManager

web = UnifiedWebManager("https://example.com")
web.url_mgr      # built on demand
web.source_code  # fetched once
web.soup         # parsed once
web.link_mgr     # shares the same chain — no re-fetch
web.crawl_mgr

# Or start from HTML you already have (zero network):
web = UnifiedWebManager(source_code="<html>...</html>")
web.soup.title

A managed requests.Session

Need a plain session, but configured with a real user agent, ciphers, the TLS adapter and proxies? Ask the stack for one — it never fetches just to build it, and reuses an existing req_mgr's session when given:

from abstract_webtools import get_managed_session

session = get_managed_session(user_agent="MyBot/1.0")
resp = session.get("https://example.com")

Mirror an entire site (usurpManager)

usurpManager saves a working offline copy of a site — pages and styles intact. By default it recursively captures the whole site: every same‑domain page link and all referenced media. It follows CSS url(...) / @import (including @font-face and cross‑domain CDN fonts), handles srcset, inline style="" and <style> blocks, downloads scripts/images/linked files, and rewrites every reference to a relative local path so the result renders straight from file://.

from abstract_webtools import usurpit

# Full recursive capture of the entire site (unlimited depth by default):
result = usurpit("https://example.com", output_dir="example_mirror")
print(result["output_dir"], len(result["pages"]), "pages")

Or drive it directly for more control:

from abstract_webtools import usurpManager, get_req_mgr

req = get_req_mgr("https://example.com")
site = usurpManager(
    "https://example.com",
    req_mgr=req,                      # reuse the managed session
    output_dir="example_mirror",
    max_depth=None,                   # default: unlimited (whole site); set an int to cap
    mirror_external_assets=True,      # pull CDN css/fonts so styles work (default)
)
summary = site.main()
  • The crawl is breadth‑first and unlimited‑depth by default (max_depth=None); the visited‑set keeps it finite/loop‑free. Pass an integer max_depth to bound it.
  • Pages are mirrored within the origin host; referenced assets may come from CDNs (set mirror_external_assets=False to stay strictly on‑origin).
  • A single url → local path map keeps references consistent and shared assets are fetched exactly once.
  • For heavily JS‑rendered sites, fetch the rendered HTML first via seleneumManager / playwriteManager.

Download video / media

from abstract_webtools import get_video_info, downloadvideo

info = get_video_info("https://www.youtube.com/watch?v=...")   # metadata only
downloadvideo("https://www.youtube.com/watch?v=...", download_directory="videos")

The downloader pulls its user agent (and proxy) from the shared request stack and threads them into yt-dlp, so downloads use the same identity as the rest of your scrape. You can inject an existing req_mgr / ua_mgr:

from abstract_webtools import VideoDownloader, get_req_mgr

req = get_req_mgr("https://example.com")
VideoDownloader(url="https://example.com/video.mp4", req_mgr=req,
                download_directory="videos")

Design notes

  • Reuse over rebuild. Every factory and constructor honors an instance you pass in. Supplying source_code or a req_mgr means zero extra network requests downstream.
  • One session, fully configured. TLS ciphers, SSL context, the HTTP adapter, user agent, proxies and cookies are assembled once by the request stack and reused — including by usurpManager and videoDownloader.
  • Optional heavy deps stay optional. Browser/media/GUI extras are imported defensively so the core package imports without them.

Testing

The repo ships dependency‑light regression tests (only requests + beautifulsoup4 required) that load the real modules under a controlled namespace and assert the no‑refetch behavior and the site mirror:

python tests/test_manager_chain.py        # url/request/soup/link chain reuse
python tests/test_video_usurp_chain.py    # managed session for video + usurp
python tests/test_usurp_mirror.py         # full-site mirror with styles

Contributing

Issues and PRs welcome at AbstractEndeavors/abstract_webtools. Please keep new functionality threaded through the shared manager chain (accept and reuse url_mgr / req_mgr / source_code) rather than re‑fetching, and add a dependency‑light test where practical.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abstract_webtools-0.1.6.426.tar.gz (123.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abstract_webtools-0.1.6.426-py3-none-any.whl (163.4 kB view details)

Uploaded Python 3

File details

Details for the file abstract_webtools-0.1.6.426.tar.gz.

File metadata

  • Download URL: abstract_webtools-0.1.6.426.tar.gz
  • Upload date:
  • Size: 123.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_webtools-0.1.6.426.tar.gz
Algorithm Hash digest
SHA256 b2105440a669293c651cfda2a53a21900ef8a4da7e47a186550a42a3fdede965
MD5 b9f3c42b14e0dead73bc64e76d3b0406
BLAKE2b-256 c0fd1379303d5460f31598c9cd45a156ae0c0f53624a9efb455da9dd733fb46b

See more details on using hashes here.

File details

Details for the file abstract_webtools-0.1.6.426-py3-none-any.whl.

File metadata

File hashes

Hashes for abstract_webtools-0.1.6.426-py3-none-any.whl
Algorithm Hash digest
SHA256 e93697e8230d451522cbb7d833e2da96be983dcc3074d28a4c90378df6d6a413
MD5 73a0a929250ba30031ebe821bce8a8b2
BLAKE2b-256 62ac84142ce1561054b2aa6158a8a4ef96a093d2cb652133f66280c4ce93dd9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page