Treat websites as programmable objects (Wikipedia-Locked Beta)
Project description
WebC – Treat Websites as Python Objects
Version: 0.1.1 Author: Ashwin Prasanth
Overview
webc is a Python library that allows you to treat websites as programmable Python objects.
Instead of manually handling HTTP requests, parsing HTML, and writing repetitive scraping logic, WebC provides a structured, object-oriented interface to access semantic content, query elements, and perform intent-driven tasks.
The goal is simple:
- Make web data feel native to Python
- Provide meaningful abstractions over raw HTML
- Encourage ethical and secure usage by default
⚠️ Developer Preview / Secure Beta
WebC v0.1.1 is a developer preview release intended for testing and feedback.
This version prioritizes security, architecture stability, and controlled usage.
APIs may change during the beta phase.
Installation
Install via pip:
pip install webc
Dependencies
- requests
- beautifulsoup4
Core Architecture
WebC is organized into four conceptual layers.
1. Resource Layer
Access a webpage as a Resource object:
from webc import web
site = web["https://en.wikipedia.org/wiki/Python_(programming_language)"]
- Represents a single webpage
- Uses lazy loading (fetches HTML only when needed)
- Caches parsed content internally
2. Structure Layer
Provides semantic, high-level content extracted from the page:
site.structure.title
site.structure.links
site.structure.images
site.structure.tables
Image Handling
- Extracts from
src,srcset,data-src, and<noscript> - Filters UI icons and SVG assets
- Resolves relative URLs automatically
Download images:
site.structure.save_images(folder="python_images")
Table Extraction
- Detects Wikipedia
wikitabletables - Handles rowspan and colspan alignment
- Removes citation brackets (e.g.,
[1])
Save tables as CSV:
site.structure.save_tables(folder="wiki_data")
3. Query Layer
Provides direct DOM access via CSS selectors:
headings = site.query["h1, h2"]
for h in headings:
print(h.get_text(strip=True))
- Returns BeautifulSoup elements
- Useful for custom extraction logic
- Acts as an advanced access layer
4. Task Layer
Provides intent-driven actions:
summary = site.task.summarize(max_chars=500)
print(summary)
Currently supported:
summarize(max_chars=500)
More tasks will be introduced in future releases.
Security & Usage Policy
This secure beta is intentionally restricted.
Platform Restrictions
- Locked to Wikipedia.org only
- Only HTTPS URLs are allowed
Built-in Protections
WebC includes safeguards against:
- SSRF attacks
- Path traversal
- Unsafe file writes
- Excessive downloads
Requests are controlled and content is cached to prevent unnecessary repeated fetching.
Responsible Use
WebC is designed for:
✔ Educational purposes ✔ Research ✔ Personal automation ✔ Ethical data access
It must not be used for:
- Mass scraping
- Circumventing website policies
- Service disruption
- Data abuse
Users are responsible for complying with website Terms of Service.
Full Usage Example
from webc import web
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
site = web[url]
print("=== STRUCTURE ===")
print(f"Title: {site.structure.title}")
print(f"Total Links: {len(site.structure.links)}")
print(f"First 5 links: {site.structure.links[:5]}")
print("\n--- Downloading Resources ---")
site.structure.save_images(folder="python_images")
site.structure.save_tables(folder="python_data")
print("\n=== QUERY ===")
headings = site.query["h1, h2"]
print(f"Found {len(headings)} headings:")
for h in headings[:3]:
print(f" - {h.get_text(strip=True)}")
print("\n=== TASK ===")
summary = site.task.summarize(max_chars=500)
print(summary)
Roadmap
Planned future improvements:
- Multi-domain support
- Advanced rate limiting
- Enhanced security layers
- Plugin-based task extensions
- Dataset export helpers
- Cloud-safe scraping mode
License
MIT License © Ashwin Prasanth
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webc-0.1.1.tar.gz.
File metadata
- Download URL: webc-0.1.1.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e2f5a829aa7eb21677bbd7744a850126ae553f12d36863a3c2f9689526d8ff6
|
|
| MD5 |
2198891136357686241cc7bb851c7128
|
|
| BLAKE2b-256 |
d9c780135e99c9bf52b06f42224252e5b4dc3166507611a3b36fee31b13f7031
|
File details
Details for the file webc-0.1.1-py3-none-any.whl.
File metadata
- Download URL: webc-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
788901c8624bf15f4bdfc080c3c16ef51d3119d5b391237d65b2278ebe4689a9
|
|
| MD5 |
ffde23ff56b298bb1a9b9dec4fc6526f
|
|
| BLAKE2b-256 |
ccf1cc81c130b6c572dfb2ee9b1fa50ecabf147e04797ef84624453c77402553
|