Skip to main content

Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content — with managed sessions, TLS/user-agent control, and a single shared request pipeline.

Project description

Abstract WebTools

Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content.

Abstract WebTools wraps the messy parts of web access — HTTP sessions, TLS/cipher configuration, user‑agent rotation, retries, HTML parsing, link extraction, crawling, headless browsers, and media downloading — behind a set of small, composable managers. The managers share a single URL → request → soup pipeline, so a page is fetched once and reused everywhere downstream instead of being re‑fetched by every layer.


Table of contents


Why

Most scraping code re‑implements the same plumbing on every project: building a session, picking a user agent, tuning TLS so a server doesn't reject you, handling retries, parsing HTML, then doing it all again for the next step.

Abstract WebTools factors each concern into a manager and threads shared instances through the chain. Pass an existing req_mgr (or source code) into any higher‑level manager and it is reused as‑is — no rebuild, no second network request.


Install

pip install abstract_webtools

Optional extras:

pip install "abstract_webtools[drivers]"   # selenium + webdriver-manager
pip install "abstract_webtools[media]"     # yt-dlp + m3u8 for video downloads
pip install "abstract_webtools[gui]"       # PyQt/PySimpleGUI helpers

Core runtime deps: requests, urllib3, beautifulsoup4. Browser and media features pull in selenium / playwright / yt-dlp as needed.


Quick start

from abstract_webtools import get_soup, get_source, linkManager

# Fetch + parse a page (one request, reused internally)
soup = get_soup("https://example.com")
print(soup.title.text)

# Just the raw HTML
html = get_source("https://example.com")

# All links + image links on the page
lm = linkManager("https://example.com")
print(lm.all_desired_links)
print(lm.all_desired_image_links)

Architecture: the manager chain

The core managers form a layered pipeline. Each layer accepts the layer(s) below it and reuses them when provided:

urlManager        normalize / validate / vary URLs
   └─ requestManager   sessions, retries, TLS, UA  ── networkManager ┐
        └─ soupManager       BeautifulSoup parsing                   ├─ userAgentManager
             ├─ linkManager       link / image extraction            ├─ cipherManager
             └─ crawlManager       site crawling / sitemaps          └─ sslManager + tlsAdapter

Every layer has a matching factory function that detects and reuses an existing instance:

Factory Returns Reuses when given
get_url_mgr(url=, url_mgr=) urlManager url_mgr
get_req_mgr(url=, url_mgr=, source_code=, req_mgr=) requestManager req_mgr
get_source(...) HTML string source_code / req_mgr
get_soup_mgr(...) soupManager soup_mgr / req_mgr
get_soup(...) BeautifulSoup soup / soup_mgr / source_code
get_crawl_mgr(...) crawlManager req_mgr / url_mgr
get_managed_session(...) requests.Session req_mgr

Because every factory short‑circuits on an instance you pass in, the whole chain is built once and shared:

from abstract_webtools import get_req_mgr, get_soup_mgr, linkManager

req = get_req_mgr("https://example.com")        # fetches once
soup_mgr = get_soup_mgr(req_mgr=req)            # no re-fetch
links = linkManager(req_mgr=req)                # no re-fetch

The managers

Manager Responsibility
urlManager Parse, validate, normalize and generate URL variants.
requestManager requests.Session with retries, timeouts, TLS adapter, UA, proxies, cookies; optional Selenium fallback.
networkManager Mounts the TLS adapter and wires proxies/cookies/UA into the session.
userAgentManager Realistic user agents and per‑URL headers (random or pinned by OS/browser).
cipherManager Cipher‑suite strings for TLS.
sslManager / tlsAdapter SSL context + HTTPAdapter for fine‑grained TLS control.
soupManager BeautifulSoup parsing, meta/link extraction, attribute discovery.
linkManager Internal/image link extraction with desired/undesired filters.
crawlManager Recursive crawling, sitemap generation, domain link discovery.
middleManager UnifiedWebManager — one lazy facade over the whole chain.
usurpManager Full‑site mirror: pages + assets + styles, references rewritten for offline use.
videoDownloader Video/media download via yt-dlp / m3u8, wired to the managed session/UA.
seleneumManager / playwriteManager Headless‑browser source fetching for JS‑rendered pages.

Common recipes

Get a page's source / soup

from abstract_webtools import get_source, get_soup, get_soup_mgr

html = get_source("https://example.com")
soup = get_soup("https://example.com")

# Reuse already-fetched HTML — no network call
soup2 = get_soup(source_code=html)

# Soup manager exposes parsing helpers
sm = get_soup_mgr("https://example.com")
print(sm.get_all_attribute_values(tags_list=["a", "img"]))

Extract links

from abstract_webtools import linkManager

lm = linkManager(
    "https://example.com",
    link_attr_value_desired=["/blog/"],      # keep only links containing this
    image_link_tags="img",
)
print(lm.all_desired_links)
print(lm.find_all_domain())                  # unique domains found

Crawl a site

from abstract_webtools import get_crawl_mgr, get_domain_crawl

crawl = get_crawl_mgr("https://example.com")
domain_links = get_domain_crawl("https://example.com", max_depth=3)

One shared context: UnifiedWebManager

UnifiedWebManager lazily builds and caches url_mgr, req_mgr, source_code, soup_mgr, soup, plus link_mgr / crawl_mgr — all over a single fetch.

from abstract_webtools import UnifiedWebManager

web = UnifiedWebManager("https://example.com")
web.url_mgr      # built on demand
web.source_code  # fetched once
web.soup         # parsed once
web.link_mgr     # shares the same chain — no re-fetch
web.crawl_mgr

# Or start from HTML you already have (zero network):
web = UnifiedWebManager(source_code="<html>...</html>")
web.soup.title

A managed requests.Session

Need a plain session, but configured with a real user agent, ciphers, the TLS adapter and proxies? Ask the stack for one — it never fetches just to build it, and reuses an existing req_mgr's session when given:

from abstract_webtools import get_managed_session

session = get_managed_session(user_agent="MyBot/1.0")
resp = session.get("https://example.com")

Mirror an entire site (usurpManager)

usurpManager saves a working offline copy of a site — pages and styles intact. By default it recursively captures the whole site: every same‑domain page link and all referenced media. It follows CSS url(...) / @import (including @font-face and cross‑domain CDN fonts), handles srcset, inline style="" and <style> blocks, downloads scripts/images/linked files, and rewrites every reference to a relative local path so the result renders straight from file://.

from abstract_webtools import usurpit

# Full recursive capture of the entire site (unlimited depth by default):
result = usurpit("https://example.com", output_dir="example_mirror")
print(result["output_dir"], len(result["pages"]), "pages")

Or drive it directly for more control:

from abstract_webtools import usurpManager, get_req_mgr

req = get_req_mgr("https://example.com")
site = usurpManager(
    "https://example.com",
    req_mgr=req,                      # reuse the managed session
    output_dir="example_mirror",
    max_depth=None,                   # default: unlimited (whole site); set an int to cap
    mirror_external_assets=True,      # pull CDN css/fonts so styles work (default)
)
summary = site.main()
  • The crawl is breadth‑first and unlimited‑depth by default (max_depth=None); the visited‑set keeps it finite/loop‑free. Pass an integer max_depth to bound it.
  • Pages are mirrored within the origin host; referenced assets may come from CDNs (set mirror_external_assets=False to stay strictly on‑origin).
  • A single url → local path map keeps references consistent and shared assets are fetched exactly once.
  • For heavily JS‑rendered sites, fetch the rendered HTML first via seleneumManager / playwriteManager.

Download video / media

from abstract_webtools import get_video_info, downloadvideo

info = get_video_info("https://www.youtube.com/watch?v=...")   # metadata only
downloadvideo("https://www.youtube.com/watch?v=...", download_directory="videos")

The downloader pulls its user agent (and proxy) from the shared request stack and threads them into yt-dlp, so downloads use the same identity as the rest of your scrape. You can inject an existing req_mgr / ua_mgr:

from abstract_webtools import VideoDownloader, get_req_mgr

req = get_req_mgr("https://example.com")
VideoDownloader(url="https://example.com/video.mp4", req_mgr=req,
                download_directory="videos")

Design notes

  • Reuse over rebuild. Every factory and constructor honors an instance you pass in. Supplying source_code or a req_mgr means zero extra network requests downstream.
  • One session, fully configured. TLS ciphers, SSL context, the HTTP adapter, user agent, proxies and cookies are assembled once by the request stack and reused — including by usurpManager and videoDownloader.
  • Optional heavy deps stay optional. Browser/media/GUI extras are imported defensively so the core package imports without them.

Testing

The repo ships dependency‑light regression tests (only requests + beautifulsoup4 required) that load the real modules under a controlled namespace and assert the no‑refetch behavior and the site mirror:

python tests/test_manager_chain.py        # url/request/soup/link chain reuse
python tests/test_video_usurp_chain.py    # managed session for video + usurp
python tests/test_usurp_mirror.py         # full-site mirror with styles

Contributing

Issues and PRs welcome at AbstractEndeavors/abstract_webtools. Please keep new functionality threaded through the shared manager chain (accept and reuse url_mgr / req_mgr / source_code) rather than re‑fetching, and add a dependency‑light test where practical.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abstract_webtools-0.1.6.429.tar.gz (125.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abstract_webtools-0.1.6.429-py3-none-any.whl (166.0 kB view details)

Uploaded Python 3

File details

Details for the file abstract_webtools-0.1.6.429.tar.gz.

File metadata

  • Download URL: abstract_webtools-0.1.6.429.tar.gz
  • Upload date:
  • Size: 125.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_webtools-0.1.6.429.tar.gz
Algorithm Hash digest
SHA256 b4493ff55a593b719ded0b114686f23c49711b5b8a7ba65d1bfc2671d837e7d0
MD5 5ce70ab6c77f5f128c3ded9ad7d8179b
BLAKE2b-256 4c6c6437eb683392a6348396e08226a604b135b6c21355bba642e34c72f4a90e

See more details on using hashes here.

File details

Details for the file abstract_webtools-0.1.6.429-py3-none-any.whl.

File metadata

File hashes

Hashes for abstract_webtools-0.1.6.429-py3-none-any.whl
Algorithm Hash digest
SHA256 31565b53160116626ffbfd7e7c61fff2c85fc03d23d95a252b74369819091c6b
MD5 8c057db5b4ce8673ff31c039781e02a3
BLAKE2b-256 80a3de6f918f5b16931828580a91c40404c2ab7545937800cbb5238235d573ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page