Skip to main content

Treat websites as programmable objects (Wikipedia-Locked Beta)

Project description

WebC – Treat Websites as Python Objects

WebC Logo

Version: 0.1.1 Author: Ashwin Prasanth


Overview

webc is a Python library that allows you to treat websites as programmable Python objects.

Instead of manually handling HTTP requests, parsing HTML, and writing repetitive scraping logic, WebC provides a structured, object-oriented interface to access semantic content, query elements, and perform intent-driven tasks.

The goal is simple:

  • Make web data feel native to Python
  • Provide meaningful abstractions over raw HTML
  • Encourage ethical and secure usage by default

⚠️ Developer Preview / Secure Beta

WebC v0.1.1 is a developer preview release intended for testing and feedback.

This version prioritizes security, architecture stability, and controlled usage.

APIs may change during the beta phase.


Installation

Install via pip:

pip install webc

Dependencies

  • requests
  • beautifulsoup4

Core Architecture

WebC is organized into four conceptual layers.


1. Resource Layer

Access a webpage as a Resource object:

from webc import web

site = web["https://en.wikipedia.org/wiki/Python_(programming_language)"]
  • Represents a single webpage
  • Uses lazy loading (fetches HTML only when needed)
  • Caches parsed content internally

2. Structure Layer

Provides semantic, high-level content extracted from the page:

site.structure.title
site.structure.links
site.structure.images
site.structure.tables

Image Handling

  • Extracts from src, srcset, data-src, and <noscript>
  • Filters UI icons and SVG assets
  • Resolves relative URLs automatically

Download images:

site.structure.save_images(folder="python_images")

Table Extraction

  • Detects Wikipedia wikitable tables
  • Handles rowspan and colspan alignment
  • Removes citation brackets (e.g., [1])

Save tables as CSV:

site.structure.save_tables(folder="wiki_data")

3. Query Layer

Provides direct DOM access via CSS selectors:

headings = site.query["h1, h2"]

for h in headings:
    print(h.get_text(strip=True))
  • Returns BeautifulSoup elements
  • Useful for custom extraction logic
  • Acts as an advanced access layer

4. Task Layer

Provides intent-driven actions:

summary = site.task.summarize(max_chars=500)
print(summary)

Currently supported:

  • summarize(max_chars=500)

More tasks will be introduced in future releases.


Security & Usage Policy

This secure beta is intentionally restricted.

Platform Restrictions

  • Locked to Wikipedia.org only
  • Only HTTPS URLs are allowed

Built-in Protections

WebC includes safeguards against:

  • SSRF attacks
  • Path traversal
  • Unsafe file writes
  • Excessive downloads

Requests are controlled and content is cached to prevent unnecessary repeated fetching.


Responsible Use

WebC is designed for:

✔ Educational purposes ✔ Research ✔ Personal automation ✔ Ethical data access

It must not be used for:

  • Mass scraping
  • Circumventing website policies
  • Service disruption
  • Data abuse

Users are responsible for complying with website Terms of Service.


Full Usage Example

from webc import web

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
site = web[url]

print("=== STRUCTURE ===")
print(f"Title: {site.structure.title}")
print(f"Total Links: {len(site.structure.links)}")
print(f"First 5 links: {site.structure.links[:5]}")

print("\n--- Downloading Resources ---")
site.structure.save_images(folder="python_images")
site.structure.save_tables(folder="python_data")

print("\n=== QUERY ===")
headings = site.query["h1, h2"]
print(f"Found {len(headings)} headings:")

for h in headings[:3]:
    print(f" - {h.get_text(strip=True)}")

print("\n=== TASK ===")
summary = site.task.summarize(max_chars=500)
print(summary)

Roadmap

Planned future improvements:

  • Multi-domain support
  • Advanced rate limiting
  • Enhanced security layers
  • Plugin-based task extensions
  • Dataset export helpers
  • Cloud-safe scraping mode

License

MIT License © Ashwin Prasanth

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webc-0.1.1.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webc-0.1.1-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file webc-0.1.1.tar.gz.

File metadata

  • Download URL: webc-0.1.1.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for webc-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9e2f5a829aa7eb21677bbd7744a850126ae553f12d36863a3c2f9689526d8ff6
MD5 2198891136357686241cc7bb851c7128
BLAKE2b-256 d9c780135e99c9bf52b06f42224252e5b4dc3166507611a3b36fee31b13f7031

See more details on using hashes here.

File details

Details for the file webc-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: webc-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for webc-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 788901c8624bf15f4bdfc080c3c16ef51d3119d5b391237d65b2278ebe4689a9
MD5 ffde23ff56b298bb1a9b9dec4fc6526f
BLAKE2b-256 ccf1cc81c130b6c572dfb2ee9b1fa50ecabf147e04797ef84624453c77402553

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page