Skip to main content

A beginner-friendly Python library for simple, chainable web scraping.

Project description

ezweb

ezweb adalah library Python yang membuat web scraping jadi sangat sederhana. Kamu tidak perlu memahami requests, BeautifulSoup, atau parsing HTML — cukup panggil method yang kamu butuhkan.

Cocok untuk:

  • 🐣 Pemula Python
  • 🤖 Programmer automation
  • 💬 Developer bot
  • 📊 Data scraper
from ezweb import Website

web = Website("https://example.com")

print(web.title())
print(web.text())
print(web.links())
print(web.images())

Instalasi

pip install ezweb

Requirement: Python 3.8+. Dependency (requests, beautifulsoup4, lxml) otomatis terpasang.

Penggunaan Dasar

from ezweb import Website

web = Website("https://example.com")

web.title()      # "Example Domain"
web.text()       # teks bersih dari halaman
web.html()       # HTML mentah/hasil clean()
web.links()      # semua link absolut
web.images()     # semua URL gambar absolut
web.headings()   # {'h1': [...], 'h2': [...], ...}
web.metadata()   # description, og:*, canonical, language, dll

Method Chaining

Gunakan Web (alias dari Website) untuk gaya penulisan yang lebih ringkas:

from ezweb import Web

data = (
    Web("https://example.com")
        .clean()
        .articles()
        .export("hasil.json")
)

clean(), articles() bisa dirangkai karena mengembalikan objek itu sendiri (self). export() mengembalikan path file yang disimpan.

Semua Method

Method Deskripsi Return
Website(url) Membuat objek & langsung fetch halaman Website
.title() Judul halaman (<title> atau <h1> fallback) str
.text() Teks bersih tanpa tag HTML str
.html() HTML saat ini (mengikuti clean()) str
.links() Semua hyperlink (absolut, unik) List[str]
.images() Semua URL gambar (absolut, unik) List[str]
.headings() Heading h1h6 terkelompok Dict[str, List[str]]
.metadata() Meta description, Open Graph, canonical, dll Dict[str, str]
.clean() Hapus script/style/nav/footer/iklan/popup/cookie-banner Website (chainable)
.download_images(folder) Unduh semua gambar ke folder lokal List[str] (path lokal)
.export(path) Simpan hasil ke .json / .txt / .html str (path)
.articles() (basic) Ekstrak blok artikel (<article> atau kumpulan <p>) Website (chainable)
.tables() (basic) Ekstrak tabel HTML List[List[List[str]]]
.forms() (basic) Ekstrak struktur form List[Dict]
.videos() (basic) Ekstrak URL video/embed List[str]

clean()

Menghapus elemen yang biasanya mengganggu:

  • script, style, iframe, nav, footer, header, aside, form
  • Popup umum & cookie banner umum
  • Elemen dengan class/id seperti ad, ads, advert, advertisement, banner, sponsor

export()

Format ditentukan otomatis dari ekstensi file:

  • .json → seluruh data (title, text, html, links, images, metadata, headings, articles)
  • .txt → hanya teks bersih
  • .html → hanya HTML

Contoh Output

>>> web = Website("https://example.com")
>>> web.title()
'Example Domain'

>>> web.links()
['https://example.com/about', 'https://www.iana.org/domains/example']

>>> web.metadata()
{'description': 'Example Domain page', 'language': 'en'}

Error Handling

ezweb menyediakan exception khusus di ezweb.exceptions:

from ezweb import InvalidURLException, RequestFailedException, ParseException

try:
    web = Website("bukan-url-valid")
except InvalidURLException as e:
    print("URL tidak valid:", e)
  • InvalidURLException — URL tidak valid/malformed
  • RequestFailedException — request gagal (timeout, DNS error, status 4xx/5xx)
  • ParseException — HTML gagal diparse

Menjalankan Test

pip install -e ".[dev]"
pytest

Roadmap

v0.1.0 (saat ini)

  • title(), text(), html(), links(), images(), metadata(), headings()
  • clean(), download_images(), export()
  • Dukungan dasar articles(), tables(), forms(), videos()

v0.2.0 (rencana)

  • articles() lebih pintar (deteksi konten utama ala readability)
  • tables() → export langsung ke CSV/DataFrame
  • forms() dengan validasi field
  • videos() untuk lebih banyak platform embed
  • download() generik (bukan hanya gambar)
  • api() — mode scraping untuk endpoint JSON
  • cache() — cache hasil fetch agar tidak request berulang
  • session(), cookies(), headers() — kontrol request tingkat lanjut

Kontribusi

Pull request sangat diterima! Struktur project ini modular (parser.py, cleaner.py, extract.py, downloader.py, exporter.py) sehingga mudah menambah fitur baru tanpa mengubah API publik Website/Web.

Lisensi

MIT License — lihat file LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ezwebb-0.1.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ezwebb-0.1.0-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file ezwebb-0.1.0.tar.gz.

File metadata

  • Download URL: ezwebb-0.1.0.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ezwebb-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63a4f42a580bc2e13fdc70c2e759ac9b828ba544bb4f40361f06b093051eef6f
MD5 06adb77d9d963744ef945d759828c538
BLAKE2b-256 491142757f53df94af7a65c07c138012b1e50027a978d7cb0b0dd61695b7da5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ezwebb-0.1.0.tar.gz:

Publisher: publish.yml on C1BENK/ezweb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ezwebb-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ezwebb-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ezwebb-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e09b4087947063991c9df25b52cac5dc18e438247e841a3a20df9ef74eaee7b8
MD5 9a1e798133ae03161553b4b3cd473344
BLAKE2b-256 fed35efc3651746de2bb5c0f54bc4a93434aefba4a2c306b353374e7e1ed7b98

See more details on using hashes here.

Provenance

The following attestation bundles were made for ezwebb-0.1.0-py3-none-any.whl:

Publisher: publish.yml on C1BENK/ezweb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page