Skip to main content

Treat websites as programmable objects (Wikipedia-Locked Beta)

Project description

WebC – Treat Websites as Python Objects

WebC Logo

Version: 0.1.1 Author: Ashwin Prasanth


Overview

webc is a Python library that allows you to treat websites as programmable Python objects.

Instead of manually handling HTTP requests, parsing HTML, and writing repetitive scraping logic, WebC provides a structured, object-oriented interface to access semantic content, query elements, and perform intent-driven tasks.

The goal is simple:

  • Make web data feel native to Python
  • Provide meaningful abstractions over raw HTML
  • Encourage ethical and secure usage by default

⚠️ Developer Preview / Secure Beta

WebC v0.1.1 is a developer preview release intended for testing and feedback.

This version prioritizes security, architecture stability, and controlled usage.

APIs may change during the beta phase.


Installation

Install via pip:

pip install webc

Dependencies

  • requests
  • beautifulsoup4

Core Architecture

WebC is organized into four conceptual layers.


1. Resource Layer

Access a webpage as a Resource object:

from webc import web

site = web["https://en.wikipedia.org/wiki/Python_(programming_language)"]
  • Represents a single webpage
  • Uses lazy loading (fetches HTML only when needed)
  • Caches parsed content internally

2. Structure Layer

Provides semantic, high-level content extracted from the page:

site.structure.title
site.structure.links
site.structure.images
site.structure.tables

Image Handling

  • Extracts from src, srcset, data-src, and <noscript>
  • Filters UI icons and SVG assets
  • Resolves relative URLs automatically

Download images:

site.structure.save_images(folder="python_images")

Table Extraction

  • Detects Wikipedia wikitable tables
  • Handles rowspan and colspan alignment
  • Removes citation brackets (e.g., [1])

Save tables as CSV:

site.structure.save_tables(folder="wiki_data")

3. Query Layer

Provides direct DOM access via CSS selectors:

headings = site.query["h1, h2"]

for h in headings:
    print(h.get_text(strip=True))
  • Returns BeautifulSoup elements
  • Useful for custom extraction logic
  • Acts as an advanced access layer

4. Task Layer

Provides intent-driven actions:

summary = site.task.summarize(max_chars=500)
print(summary)

Currently supported:

  • summarize(max_chars=500)

More tasks will be introduced in future releases.


Security & Usage Policy

This secure beta is intentionally restricted.

Platform Restrictions

  • Locked to Wikipedia.org only
  • Only HTTPS URLs are allowed

Built-in Protections

WebC includes safeguards against:

  • SSRF attacks
  • Path traversal
  • Unsafe file writes
  • Excessive downloads

Requests are controlled and content is cached to prevent unnecessary repeated fetching.


Responsible Use

WebC is designed for:

✔ Educational purposes ✔ Research ✔ Personal automation ✔ Ethical data access

It must not be used for:

  • Mass scraping
  • Circumventing website policies
  • Service disruption
  • Data abuse

Users are responsible for complying with website Terms of Service.


Full Usage Example

from webc import web

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
site = web[url]

print("=== STRUCTURE ===")
print(f"Title: {site.structure.title}")
print(f"Total Links: {len(site.structure.links)}")
print(f"First 5 links: {site.structure.links[:5]}")

print("\n--- Downloading Resources ---")
site.structure.save_images(folder="python_images")
site.structure.save_tables(folder="python_data")

print("\n=== QUERY ===")
headings = site.query["h1, h2"]
print(f"Found {len(headings)} headings:")

for h in headings[:3]:
    print(f" - {h.get_text(strip=True)}")

print("\n=== TASK ===")
summary = site.task.summarize(max_chars=500)
print(summary)

Roadmap

Planned future improvements:

  • Multi-domain support
  • Advanced rate limiting
  • Enhanced security layers
  • Plugin-based task extensions
  • Dataset export helpers
  • Cloud-safe scraping mode

License

This project is licensed under the MIT License. See the LICENSE file for the full license text.

© 2026 Ashwin Prasanth

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webc-0.1.2.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webc-0.1.2-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file webc-0.1.2.tar.gz.

File metadata

  • Download URL: webc-0.1.2.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for webc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 39fb625f4547bf80ced5c0a2a1ae8ba6656b4971f3f22956f1edb308ccaa5ee8
MD5 c5909506497965e238542bb47068ba5d
BLAKE2b-256 f7faf0d00be3d3e1baa4d7c9de1a8ccb852f1f7c61dd8080bf31e96f96660e84

See more details on using hashes here.

File details

Details for the file webc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: webc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for webc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 69df532a06ae78b8b30f369be90c0324f8bbe2220ceabd964b9e1a860c5cf318
MD5 ad31155ade35a902af06d4d1c8cfd823
BLAKE2b-256 1f67bb74b00578361e829b870391bc6b8da7fcc16f60aa05d82d7de629d916da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page