A lightweight, Pythonic web scraping toolkit built on BeautifulSoup and Requests.
Project description
๐ SnakyScraper
SnakyScraper is a lightweight, Pythonic web scraping toolkit built on top of BeautifulSoup and Requests. It gives you a clean interface for pulling structured HTML and metadata out of any web page โ titles, Open Graph tags, headings, links, images, and arbitrary DOM selectors โ with predictable, JSON-friendly return values.
Fast. Accurate. Snake-style scraping. ๐๐ฏ
๐ Table of Contents
- Features
- Installation
- Quick Start
- Handling Errors
- API Reference
- Custom DOM Filtering
- Project Structure
- Development
- Changelog
- Bahasa Indonesia
๐ Features
- โ Extract metadata: title, description, keywords, author, charset, canonical URL, and more
- โ Built-in support for Open Graph, Twitter Card, and CSRF tags
- โ
Extract HTML structures:
h1โh6,p,ul,ol, images, links - โ
Powerful
filter()method with tag, class, and ID-based selectors - โ
Parse raw HTML directly with
html=โ no network call required - โ
Proper error handling: inspect
.error/.status_codeor opt into exceptions withraise_on_error=True - โ
Custom headers, timeout, and
requests.Sessionreuse for real-world scraping - โ
Fully type-hinted, ships with
py.typed(PEP 561) for IDE & mypy support - โ
Zero bare
except:blocks, no silent 404/500 pass-throughs - โ Powered by BeautifulSoup4 and Requests โ no heavyweight dependencies
๐ฆ Installation
pip install snakyscraper
Optional extras:
# Faster HTML parsing via lxml
pip install "snakyscraper[lxml]"
# Development tools (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"
Requires Python 3.8 or later.
๐ ๏ธ Quick Start
from snakyscraper import SnakyScraper
scraper = SnakyScraper("https://example.com")
if scraper.ok():
print(scraper.title()) # "Example Domain"
print(scraper.description()) # meta description, or None
print(scraper.h1()) # ["Example Domain"]
print(scraper.open_graph()) # {"og:title": ..., "og:image": ..., ...}
else:
print("Failed:", scraper.error)
Parsing HTML you already have (no network call)
Useful for tests, cached pages, or HTML obtained from a headless browser:
scraper = SnakyScraper(html="<title>Hello</title><h1>Hi there</h1>")
scraper.title() # "Hello"
scraper.h1() # ["Hi there"]
Custom headers, timeout, and session reuse
import requests
from snakyscraper import SnakyScraper
session = requests.Session()
scraper = SnakyScraper(
"https://example.com",
timeout=15,
headers={"Accept-Language": "id-ID,en;q=0.8"},
session=session, # reuse connections/cookies across multiple scrapes
)
โ ๏ธ Handling Errors
By default, SnakyScraper never raises โ failures (invalid URL, network error, HTTP 4xx/5xx, parse failure) are captured instead of thrown, so a single bad URL in a batch job won't crash the whole run.
scraper = SnakyScraper("https://example.com/this-page-does-not-exist")
scraper.ok() # False
scraper.status_code # 404
scraper.error # HTTPStatusError("'...' returned HTTP 404.")
If you'd rather fail fast (e.g. while developing), pass raise_on_error=True:
from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError
try:
scraper = SnakyScraper("https://example.com/missing", raise_on_error=True)
except HTTPStatusError as e:
print("Server returned an error status:", e.status_code)
except InvalidURLError:
print("That URL is malformed.")
except FetchError as e:
print("Network problem:", e)
Exception hierarchy:
SnakyScraperError
โโโ InvalidURLError # bad/missing/non-http(s) URL
โโโ FetchError # network failure (timeout, DNS, connection refused, ...)
โ โโโ HTTPStatusError # non-2xx response (has .status_code)
โโโ ParseError # HTML could not be parsed
๐ API Reference
๐น Status
| Method | Returns | Description |
|---|---|---|
ok() |
bool |
True if the page was fetched and parsed successfully |
.error |
Exception | None |
The exception captured during construction, if any |
.status_code |
int | None |
HTTP status code of the response, if a request was made |
๐น Page Metadata
| Method | Returns |
|---|---|
title() |
str | None |
description() |
str | None |
keywords() |
list[str] | None |
keyword_string() |
str | None |
charset() |
str | None โ reads both <meta charset> and legacy http-equiv forms |
canonical() |
str | None |
content_type() |
str | None |
author() |
str | None |
csrf_token() |
str | None โ checks meta tag, then hidden input |
image() |
str | None โ shortcut for og:image |
viewport() |
list[str] | None |
viewport_string() |
str | None |
๐น Open Graph & Twitter Card
scraper.open_graph() # dict of common og:* properties
scraper.open_graph("og:title") # a single property
scraper.twitter_card() # dict of common twitter:* properties
scraper.twitter_card("twitter:title") # a single property
๐น Headings & Text
scraper.h1() # list[str]
scraper.h2()
scraper.h3()
scraper.h4()
scraper.h5()
scraper.h6()
scraper.p()
๐น Lists
scraper.ul() # flattened text of every <li> in every <ul>
scraper.ol() # flattened text of every <li> in every <ol>
๐น Images
scraper.images() # ["/img/1.jpg", "/img/2.jpg", ...]
scraper.image_details() # [{"url": ..., "alt_text": ..., "title": ...}, ...]
๐น Links
scraper.links() # list of href strings (anchors with no href are skipped)
scraper.link_details() # list of dicts: url, protocol, text, title, target, rel, is_nofollow, ...
๐ Custom DOM Filtering
Use filter() to target specific elements and optionally pull nested content out of them.
โธ Single element
scraper.filter(
element="div",
attributes={"id": "main"},
multiple=False,
extract=[".title", "#description", "p"],
)
โธ Multiple elements
scraper.filter(
element="div",
attributes={"class": "card"},
multiple=True,
extract=["h3", ".subtitle", "#meta"],
)
extractselectors: a tag name ("h3"), a class (.titleโ keyclass__title), or an ID (#metaโ keyid__meta).
โธ Clean text instead of raw HTML
scraper.filter(
element="p",
attributes={"class": "dark-text"},
multiple=True,
return_html=False,
)
๐ Project Structure
snakyscraper/
โโโ snakyscraper/
โ โโโ __init__.py # public API surface (SnakyScraper, exceptions, __version__)
โ โโโ core.py # SnakyScraper implementation
โ โโโ exceptions.py # SnakyScraperError and subclasses
โ โโโ _version.py # single source of truth for the version string
โ โโโ py.typed # PEP 561 marker
โโโ tests/
โ โโโ conftest.py # shared fixtures (sample HTML pages)
โ โโโ test_metadata.py # title, og, twitter, charset, csrf, ...
โ โโโ test_content.py # headings, lists, images, links
โ โโโ test_filter.py # filter() DOM queries
โ โโโ test_fetching.py # URL validation, HTTP mocking, error handling
โโโ examples/
โ โโโ basic_usage.py
โโโ pyproject.toml # build system + project metadata + tool config
โโโ LICENSE
โโโ README.md
This split keeps the public API (__init__.py) thin, the implementation (core.py) self-contained, and error types (exceptions.py) reusable without importing the whole scraping engine โ making the codebase easier to navigate and extend.
๐งโ๐ป Development
git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"
# Run the test suite (mocked HTTP, no real network calls)
pytest
# With coverage
pytest --cov=snakyscraper --cov-report=term-missing
# Type-check
mypy snakyscraper/
# Build distributable wheel/sdist
python -m build
Contributing
Found a bug or want to request a feature? Open an issue or submit a pull request.
๐ Changelog
See CHANGELOG.md for the full version history. Highlights for v1.1.0:
- Restructured into a proper multi-module package (
core,exceptions,_version) - Fixed: HTTP error pages (404/500) no longer silently treated as successful
- Fixed:
charset()now reads legacyhttp-equivcharset declarations - Fixed:
link_details()no longer breaks on anchors withouthref - Fixed:
title()now returns a cleanstrinstead of aNavigableString - Added:
html=kwarg to parse raw HTML with no network call - Added: typed exception hierarchy (
InvalidURLError,FetchError,HTTPStatusError,ParseError) - Added:
.error,.status_code,.ok(),raise_on_error=, customheaders=/session= - Added: full type hints +
py.typed, full pytest suite (63 tests, 92% coverage) - Renamed: project ownership moved from
riodevnettoioodev
๐ License
MIT License ยฉ 2025โ2026 โ ioodev
๐ Related Projects
- BeautifulSoup4
- Requests
- lxml
- NodeScraper (
@ioodev/nodescraper) โ the Node.js sibling of this library - ElephScraper โ the PHP sibling of this library
๐ก Why SnakyScraper?
Think of it as your Pythonic sniper โ targeting HTML content with precision and elegance.
๐ฎ๐ฉ Bahasa Indonesia
SnakyScraper adalah toolkit web scraping yang ringan dan Pythonic, dibangun di atas BeautifulSoup dan Requests. Library ini menyediakan antarmuka yang bersih untuk mengambil HTML terstruktur dan metadata dari halaman web mana pun โ judul, tag Open Graph, heading, link, gambar, hingga selector DOM khusus โ dengan nilai kembalian yang konsisten dan ramah JSON.
๐ Fitur
- Ekstraksi metadata: title, description, keywords, author, charset, canonical URL, dan lainnya
- Dukungan bawaan untuk Open Graph, Twitter Card, dan tag CSRF
- Ekstraksi struktur HTML:
h1โh6,p,ul,ol, gambar, link - Metode
filter()yang fleksibel dengan selector tag, class, dan ID - Bisa parsing HTML langsung lewat
html=โ tanpa perlu request ke jaringan - Penanganan error yang jelas: cek
.error/.status_code, atau aktifkan exception denganraise_on_error=True - Dukungan custom headers, timeout, dan reuse
requests.Sessionuntuk kebutuhan scraping nyata - Type hints lengkap, sudah menyertakan
py.typed(PEP 561) untuk dukungan IDE & mypy - Tidak ada lagi blok
except:kosong, tidak ada lagi halaman 404/500 yang lolos begitu saja
๐ฆ Instalasi
pip install snakyscraper
Ekstra opsional:
# Parsing HTML lebih cepat dengan lxml
pip install "snakyscraper[lxml]"
# Tools development (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"
Membutuhkan Python 3.8 atau lebih baru.
๐ ๏ธ Penggunaan Dasar
from snakyscraper import SnakyScraper
scraper = SnakyScraper("https://example.com")
if scraper.ok():
print(scraper.title())
print(scraper.description())
print(scraper.h1())
print(scraper.open_graph())
else:
print("Gagal:", scraper.error)
Parsing HTML yang sudah dimiliki (tanpa request jaringan)
scraper = SnakyScraper(html="<title>Halo</title><h1>Selamat datang</h1>")
scraper.title() # "Halo"
scraper.h1() # ["Selamat datang"]
โ ๏ธ Penanganan Error
Secara default, SnakyScraper tidak pernah melempar exception โ kegagalan (URL tidak valid, error jaringan, HTTP 4xx/5xx, gagal parsing) ditangkap secara internal, sehingga satu URL bermasalah di dalam batch job tidak akan menghentikan seluruh proses.
scraper = SnakyScraper("https://example.com/halaman-tidak-ada")
scraper.ok() # False
scraper.status_code # 404
scraper.error # HTTPStatusError("'...' returned HTTP 404.")
Jika ingin error langsung dilempar sebagai exception (misalnya saat development), gunakan raise_on_error=True:
from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError
try:
scraper = SnakyScraper("https://example.com/halaman-tidak-ada", raise_on_error=True)
except HTTPStatusError as e:
print("Server mengembalikan status error:", e.status_code)
except InvalidURLError:
print("URL tidak valid.")
except FetchError as e:
print("Masalah jaringan:", e)
๐ Filtering DOM Khusus
scraper.filter(
element="div",
attributes={"class": "card"},
multiple=True,
extract=["h3", ".subtitle", "#meta"],
)
Selector
extract: nama tag ("h3"), class (.titleโ keyclass__title), atau ID (#metaโ keyid__meta).
๐งโ๐ป Development
git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"
pytest # jalankan test suite
mypy snakyscraper/ # type-check
python -m build # build wheel/sdist
๐ Lisensi
MIT License ยฉ 2025โ2026 โ ioodev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file snakyscraper-1.1.0.tar.gz.
File metadata
- Download URL: snakyscraper-1.1.0.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ea7b5736c4b615705c08350879a542e815f10d7156e38e75dbc752b7e8ef9e6
|
|
| MD5 |
fb3a6039a8578a8f0c0607fc965b9476
|
|
| BLAKE2b-256 |
70a3b98b66114e23a677fa47a2915ab2858b8b9dd963e4db21098a86e9887c33
|
File details
Details for the file snakyscraper-1.1.0-py3-none-any.whl.
File metadata
- Download URL: snakyscraper-1.1.0-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d576db6857594b7196b9ecfb7b00492a95dc9857ba78d8a0449479955e9e4d4
|
|
| MD5 |
aae0718639ceea5c5f450336d5e06acc
|
|
| BLAKE2b-256 |
3a46929da2098162f83c408f1815f3a6c9ac953dfc2384891a1e1cef8b1f9722
|