A beginner-friendly Python library for simple, chainable web scraping.
Project description
ezweb
ezweb adalah library Python yang membuat web scraping jadi sangat sederhana. Kamu tidak perlu memahami requests, BeautifulSoup, atau parsing HTML — cukup panggil method yang kamu butuhkan.
Cocok untuk:
- 🐣 Pemula Python
- 🤖 Programmer automation
- 💬 Developer bot
- 📊 Data scraper
from ezweb import Website
web = Website("https://example.com")
print(web.title())
print(web.text())
print(web.links())
print(web.images())
Instalasi
pip install ezweb
Requirement: Python 3.8+. Dependency (requests, beautifulsoup4, lxml) otomatis terpasang.
Penggunaan Dasar
from ezweb import Website
web = Website("https://example.com")
web.title() # "Example Domain"
web.text() # teks bersih dari halaman
web.html() # HTML mentah/hasil clean()
web.links() # semua link absolut
web.images() # semua URL gambar absolut
web.headings() # {'h1': [...], 'h2': [...], ...}
web.metadata() # description, og:*, canonical, language, dll
Method Chaining
Gunakan Web (alias dari Website) untuk gaya penulisan yang lebih ringkas:
from ezweb import Web
data = (
Web("https://example.com")
.clean()
.articles()
.export("hasil.json")
)
clean(), articles() bisa dirangkai karena mengembalikan objek itu sendiri (self). export() mengembalikan path file yang disimpan.
Semua Method
| Method | Deskripsi | Return |
|---|---|---|
Website(url) |
Membuat objek & langsung fetch halaman | Website |
.title() |
Judul halaman (<title> atau <h1> fallback) |
str |
.text() |
Teks bersih tanpa tag HTML | str |
.html() |
HTML saat ini (mengikuti clean()) |
str |
.links() |
Semua hyperlink (absolut, unik) | List[str] |
.images() |
Semua URL gambar (absolut, unik) | List[str] |
.headings() |
Heading h1–h6 terkelompok |
Dict[str, List[str]] |
.metadata() |
Meta description, Open Graph, canonical, dll | Dict[str, str] |
.clean() |
Hapus script/style/nav/footer/iklan/popup/cookie-banner | Website (chainable) |
.download_images(folder) |
Unduh semua gambar ke folder lokal | List[str] (path lokal) |
.export(path) |
Simpan hasil ke .json / .txt / .html |
str (path) |
.articles() (basic) |
Ekstrak blok artikel (<article> atau kumpulan <p>) |
Website (chainable) |
.tables() (basic) |
Ekstrak tabel HTML | List[List[List[str]]] |
.forms() (basic) |
Ekstrak struktur form | List[Dict] |
.videos() (basic) |
Ekstrak URL video/embed | List[str] |
clean()
Menghapus elemen yang biasanya mengganggu:
script,style,iframe,nav,footer,header,aside,form- Popup umum & cookie banner umum
- Elemen dengan class/id seperti
ad,ads,advert,advertisement,banner,sponsor
export()
Format ditentukan otomatis dari ekstensi file:
.json→ seluruh data (title, text, html, links, images, metadata, headings, articles).txt→ hanya teks bersih.html→ hanya HTML
Contoh Output
>>> web = Website("https://example.com")
>>> web.title()
'Example Domain'
>>> web.links()
['https://example.com/about', 'https://www.iana.org/domains/example']
>>> web.metadata()
{'description': 'Example Domain page', 'language': 'en'}
Error Handling
ezweb menyediakan exception khusus di ezweb.exceptions:
from ezweb import InvalidURLException, RequestFailedException, ParseException
try:
web = Website("bukan-url-valid")
except InvalidURLException as e:
print("URL tidak valid:", e)
InvalidURLException— URL tidak valid/malformedRequestFailedException— request gagal (timeout, DNS error, status 4xx/5xx)ParseException— HTML gagal diparse
Menjalankan Test
pip install -e ".[dev]"
pytest
Roadmap
v0.1.0 (saat ini)
title(),text(),html(),links(),images(),metadata(),headings()clean(),download_images(),export()- Dukungan dasar
articles(),tables(),forms(),videos()
v0.2.0 (rencana)
articles()lebih pintar (deteksi konten utama ala readability)tables()→ export langsung ke CSV/DataFrameforms()dengan validasi fieldvideos()untuk lebih banyak platform embeddownload()generik (bukan hanya gambar)api()— mode scraping untuk endpoint JSONcache()— cache hasil fetch agar tidak request berulangsession(),cookies(),headers()— kontrol request tingkat lanjut
Kontribusi
Pull request sangat diterima! Struktur project ini modular (parser.py, cleaner.py, extract.py, downloader.py, exporter.py) sehingga mudah menambah fitur baru tanpa mengubah API publik Website/Web.
Lisensi
MIT License — lihat file LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ezwebb-0.1.0.tar.gz.
File metadata
- Download URL: ezwebb-0.1.0.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63a4f42a580bc2e13fdc70c2e759ac9b828ba544bb4f40361f06b093051eef6f
|
|
| MD5 |
06adb77d9d963744ef945d759828c538
|
|
| BLAKE2b-256 |
491142757f53df94af7a65c07c138012b1e50027a978d7cb0b0dd61695b7da5e
|
Provenance
The following attestation bundles were made for ezwebb-0.1.0.tar.gz:
Publisher:
publish.yml on C1BENK/ezweb
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ezwebb-0.1.0.tar.gz -
Subject digest:
63a4f42a580bc2e13fdc70c2e759ac9b828ba544bb4f40361f06b093051eef6f - Sigstore transparency entry: 2068163328
- Sigstore integration time:
-
Permalink:
C1BENK/ezweb@1f617392d1f3c07de6c1d09480bb0cd0c597582e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/C1BENK
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1f617392d1f3c07de6c1d09480bb0cd0c597582e -
Trigger Event:
push
-
Statement type:
File details
Details for the file ezwebb-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ezwebb-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e09b4087947063991c9df25b52cac5dc18e438247e841a3a20df9ef74eaee7b8
|
|
| MD5 |
9a1e798133ae03161553b4b3cd473344
|
|
| BLAKE2b-256 |
fed35efc3651746de2bb5c0f54bc4a93434aefba4a2c306b353374e7e1ed7b98
|
Provenance
The following attestation bundles were made for ezwebb-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on C1BENK/ezweb
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ezwebb-0.1.0-py3-none-any.whl -
Subject digest:
e09b4087947063991c9df25b52cac5dc18e438247e841a3a20df9ef74eaee7b8 - Sigstore transparency entry: 2068163873
- Sigstore integration time:
-
Permalink:
C1BENK/ezweb@1f617392d1f3c07de6c1d09480bb0cd0c597582e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/C1BENK
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1f617392d1f3c07de6c1d09480bb0cd0c597582e -
Trigger Event:
push
-
Statement type: