A lightweight, Pythonic web scraping toolkit built on BeautifulSoup and Requests.

These details have not been verified by PyPI

Project links

Project description

🐍 SnakyScraper

SnakyScraper is a lightweight, Pythonic web scraping toolkit built on top of BeautifulSoup and Requests. It gives you a clean interface for pulling structured HTML and metadata out of any web page — titles, Open Graph tags, headings, links, images, and arbitrary DOM selectors — with predictable, JSON-friendly return values.

Fast. Accurate. Snake-style scraping. 🐍🎯

Bahasa: English | Indonesia

🚀 Features

✅ Extract metadata: title, description, keywords, author, charset, canonical URL, and more
✅ Built-in support for Open Graph, Twitter Card, and CSRF tags
✅ Extract HTML structures: h1–h6, p, ul, ol, images, links
✅ Powerful filter() method with tag, class, and ID-based selectors
✅ Parse raw HTML directly with html= — no network call required
✅ Proper error handling: inspect .error / .status_code or opt into exceptions with raise_on_error=True
✅ Custom headers, timeout, and requests.Session reuse for real-world scraping
✅ Fully type-hinted, ships with py.typed (PEP 561) for IDE & mypy support
✅ Zero bare except: blocks, no silent 404/500 pass-throughs
✅ Powered by BeautifulSoup4 and Requests — no heavyweight dependencies

📦 Installation

pip install snakyscraper

Optional extras:

# Faster HTML parsing via lxml
pip install "snakyscraper[lxml]"

# Development tools (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"

Requires Python 3.8 or later.

🛠️ Quick Start

from snakyscraper import SnakyScraper

scraper = SnakyScraper("https://example.com")

if scraper.ok():
    print(scraper.title())          # "Example Domain"
    print(scraper.description())    # meta description, or None
    print(scraper.h1())             # ["Example Domain"]
    print(scraper.open_graph())     # {"og:title": ..., "og:image": ..., ...}
else:
    print("Failed:", scraper.error)

Parsing HTML you already have (no network call)

Useful for tests, cached pages, or HTML obtained from a headless browser:

scraper = SnakyScraper(html="<title>Hello</title><h1>Hi there</h1>")
scraper.title()  # "Hello"
scraper.h1()     # ["Hi there"]

Custom headers, timeout, and session reuse

import requests
from snakyscraper import SnakyScraper

session = requests.Session()
scraper = SnakyScraper(
    "https://example.com",
    timeout=15,
    headers={"Accept-Language": "id-ID,en;q=0.8"},
    session=session,  # reuse connections/cookies across multiple scrapes
)

⚠️ Handling Errors

By default, SnakyScraper never raises — failures (invalid URL, network error, HTTP 4xx/5xx, parse failure) are captured instead of thrown, so a single bad URL in a batch job won't crash the whole run.

scraper = SnakyScraper("https://example.com/this-page-does-not-exist")

scraper.ok()           # False
scraper.status_code    # 404
scraper.error          # HTTPStatusError("'...' returned HTTP 404.")

If you'd rather fail fast (e.g. while developing), pass raise_on_error=True:

from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError

try:
    scraper = SnakyScraper("https://example.com/missing", raise_on_error=True)
except HTTPStatusError as e:
    print("Server returned an error status:", e.status_code)
except InvalidURLError:
    print("That URL is malformed.")
except FetchError as e:
    print("Network problem:", e)

Exception hierarchy:

SnakyScraperError
├── InvalidURLError    # bad/missing/non-http(s) URL
├── FetchError          # network failure (timeout, DNS, connection refused, ...)
│   └── HTTPStatusError # non-2xx response (has .status_code)
└── ParseError          # HTML could not be parsed

📖 API Reference

🔹 Status

Method	Returns	Description
`ok()`	`bool`	`True` if the page was fetched and parsed successfully
`.error`	`Exception \| None`	The exception captured during construction, if any
`.status_code`	`int \| None`	HTTP status code of the response, if a request was made

🔹 Page Metadata

Method	Returns
`title()`	`str \| None`
`description()`	`str \| None`
`keywords()`	`list[str] \| None`
`keyword_string()`	`str \| None`
`charset()`	`str \| None` — reads both `<meta charset>` and legacy `http-equiv` forms
`canonical()`	`str \| None`
`content_type()`	`str \| None`
`author()`	`str \| None`
`csrf_token()`	`str \| None` — checks meta tag, then hidden input
`image()`	`str \| None` — shortcut for `og:image`
`viewport()`	`list[str] \| None`
`viewport_string()`	`str \| None`

🔹 Open Graph & Twitter Card

scraper.open_graph()              # dict of common og:* properties
scraper.open_graph("og:title")    # a single property

scraper.twitter_card()                  # dict of common twitter:* properties
scraper.twitter_card("twitter:title")   # a single property

🔹 Headings & Text

scraper.h1()  # list[str]
scraper.h2()
scraper.h3()
scraper.h4()
scraper.h5()
scraper.h6()
scraper.p()

🔹 Lists

scraper.ul()  # flattened text of every <li> in every <ul>
scraper.ol()  # flattened text of every <li> in every <ol>

🔹 Images

scraper.images()         # ["/img/1.jpg", "/img/2.jpg", ...]
scraper.image_details()  # [{"url": ..., "alt_text": ..., "title": ...}, ...]

🔹 Links

scraper.links()         # list of href strings (anchors with no href are skipped)
scraper.link_details()  # list of dicts: url, protocol, text, title, target, rel, is_nofollow, ...

🔍 Custom DOM Filtering

Use filter() to target specific elements and optionally pull nested content out of them.

▸ Single element

scraper.filter(
    element="div",
    attributes={"id": "main"},
    multiple=False,
    extract=[".title", "#description", "p"],
)

▸ Multiple elements

scraper.filter(
    element="div",
    attributes={"class": "card"},
    multiple=True,
    extract=["h3", ".subtitle", "#meta"],
)

extract selectors: a tag name ("h3"), a class (.title → key class__title), or an ID (#meta → key id__meta).

▸ Clean text instead of raw HTML

scraper.filter(
    element="p",
    attributes={"class": "dark-text"},
    multiple=True,
    return_html=False,
)

🗂 Project Structure

snakyscraper/
├── snakyscraper/
│   ├── __init__.py       # public API surface (SnakyScraper, exceptions, __version__)
│   ├── core.py           # SnakyScraper implementation
│   ├── exceptions.py     # SnakyScraperError and subclasses
│   ├── _version.py       # single source of truth for the version string
│   └── py.typed          # PEP 561 marker
├── tests/
│   ├── conftest.py       # shared fixtures (sample HTML pages)
│   ├── test_metadata.py  # title, og, twitter, charset, csrf, ...
│   ├── test_content.py   # headings, lists, images, links
│   ├── test_filter.py    # filter() DOM queries
│   └── test_fetching.py  # URL validation, HTTP mocking, error handling
├── examples/
│   └── basic_usage.py
├── pyproject.toml        # build system + project metadata + tool config
├── LICENSE
└── README.md

This split keeps the public API (__init__.py) thin, the implementation (core.py) self-contained, and error types (exceptions.py) reusable without importing the whole scraping engine — making the codebase easier to navigate and extend.

🧑‍💻 Development

git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"

# Run the test suite (mocked HTTP, no real network calls)
pytest

# With coverage
pytest --cov=snakyscraper --cov-report=term-missing

# Type-check
mypy snakyscraper/

# Build distributable wheel/sdist
python -m build

Contributing

Found a bug or want to request a feature? Open an issue or submit a pull request.

📝 Changelog

See CHANGELOG.md for the full version history. Highlights for v1.1.0:

Restructured into a proper multi-module package (core, exceptions, _version)
Fixed: HTTP error pages (404/500) no longer silently treated as successful
Fixed: charset() now reads legacy http-equiv charset declarations
Fixed: link_details() no longer breaks on anchors without href
Fixed: title() now returns a clean str instead of a NavigableString
Added: html= kwarg to parse raw HTML with no network call
Added: typed exception hierarchy (InvalidURLError, FetchError, HTTPStatusError, ParseError)
Added: .error, .status_code, .ok(), raise_on_error=, custom headers=/session=
Added: full type hints + py.typed, full pytest suite (63 tests, 92% coverage)
Renamed: project ownership moved from riodevnet to ioodev

📄 License

🔗 Related Projects

BeautifulSoup4
Requests
lxml
NodeScraper (@ioodev/nodescraper) — the Node.js sibling of this library
ElephScraper — the PHP sibling of this library

💡 Why SnakyScraper?

Think of it as your Pythonic sniper — targeting HTML content with precision and elegance.

🇮🇩 Bahasa Indonesia

SnakyScraper adalah toolkit web scraping yang ringan dan Pythonic, dibangun di atas BeautifulSoup dan Requests. Library ini menyediakan antarmuka yang bersih untuk mengambil HTML terstruktur dan metadata dari halaman web mana pun — judul, tag Open Graph, heading, link, gambar, hingga selector DOM khusus — dengan nilai kembalian yang konsisten dan ramah JSON.

🚀 Fitur

Ekstraksi metadata: title, description, keywords, author, charset, canonical URL, dan lainnya
Dukungan bawaan untuk Open Graph, Twitter Card, dan tag CSRF
Ekstraksi struktur HTML: h1–h6, p, ul, ol, gambar, link
Metode filter() yang fleksibel dengan selector tag, class, dan ID
Bisa parsing HTML langsung lewat html= — tanpa perlu request ke jaringan
Penanganan error yang jelas: cek .error / .status_code, atau aktifkan exception dengan raise_on_error=True
Dukungan custom headers, timeout, dan reuse requests.Session untuk kebutuhan scraping nyata
Type hints lengkap, sudah menyertakan py.typed (PEP 561) untuk dukungan IDE & mypy
Tidak ada lagi blok except: kosong, tidak ada lagi halaman 404/500 yang lolos begitu saja

📦 Instalasi

pip install snakyscraper

Ekstra opsional:

# Parsing HTML lebih cepat dengan lxml
pip install "snakyscraper[lxml]"

# Tools development (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"

Membutuhkan Python 3.8 atau lebih baru.

🛠️ Penggunaan Dasar

from snakyscraper import SnakyScraper

scraper = SnakyScraper("https://example.com")

if scraper.ok():
    print(scraper.title())
    print(scraper.description())
    print(scraper.h1())
    print(scraper.open_graph())
else:
    print("Gagal:", scraper.error)

Parsing HTML yang sudah dimiliki (tanpa request jaringan)

scraper = SnakyScraper(html="<title>Halo</title><h1>Selamat datang</h1>")
scraper.title()  # "Halo"
scraper.h1()     # ["Selamat datang"]

⚠️ Penanganan Error

Secara default, SnakyScraper tidak pernah melempar exception — kegagalan (URL tidak valid, error jaringan, HTTP 4xx/5xx, gagal parsing) ditangkap secara internal, sehingga satu URL bermasalah di dalam batch job tidak akan menghentikan seluruh proses.

scraper = SnakyScraper("https://example.com/halaman-tidak-ada")

scraper.ok()           # False
scraper.status_code    # 404
scraper.error          # HTTPStatusError("'...' returned HTTP 404.")

Jika ingin error langsung dilempar sebagai exception (misalnya saat development), gunakan raise_on_error=True:

from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError

try:
    scraper = SnakyScraper("https://example.com/halaman-tidak-ada", raise_on_error=True)
except HTTPStatusError as e:
    print("Server mengembalikan status error:", e.status_code)
except InvalidURLError:
    print("URL tidak valid.")
except FetchError as e:
    print("Masalah jaringan:", e)

🔍 Filtering DOM Khusus

scraper.filter(
    element="div",
    attributes={"class": "card"},
    multiple=True,
    extract=["h3", ".subtitle", "#meta"],
)

Selector extract: nama tag ("h3"), class (.title → key class__title), atau ID (#meta → key id__meta).

🧑‍💻 Development

git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"

pytest                  # jalankan test suite
mypy snakyscraper/      # type-check
python -m build         # build wheel/sdist

📄 Lisensi

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakyscraper-1.1.0.tar.gz (22.2 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

snakyscraper-1.1.0-py3-none-any.whl (14.9 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file snakyscraper-1.1.0.tar.gz.

File metadata

Download URL: snakyscraper-1.1.0.tar.gz
Upload date: Jun 23, 2026
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for snakyscraper-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0ea7b5736c4b615705c08350879a542e815f10d7156e38e75dbc752b7e8ef9e6`
MD5	`fb3a6039a8578a8f0c0607fc965b9476`
BLAKE2b-256	`70a3b98b66114e23a677fa47a2915ab2858b8b9dd963e4db21098a86e9887c33`

See more details on using hashes here.

File details

Details for the file snakyscraper-1.1.0-py3-none-any.whl.

File metadata

Download URL: snakyscraper-1.1.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 14.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for snakyscraper-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d576db6857594b7196b9ecfb7b00492a95dc9857ba78d8a0449479955e9e4d4`
MD5	`aae0718639ceea5c5f450336d5e06acc`
BLAKE2b-256	`3a46929da2098162f83c408f1815f3a6c9ac953dfc2384891a1e1cef8b1f9722`

See more details on using hashes here.

snakyscraper 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🐍 SnakyScraper

📋 Table of Contents

🚀 Features

📦 Installation

🛠️ Quick Start

Parsing HTML you already have (no network call)

Custom headers, timeout, and session reuse

⚠️ Handling Errors

📖 API Reference

🔹 Status

🔹 Page Metadata

🔹 Open Graph & Twitter Card

🔹 Headings & Text

🔹 Lists

🔹 Images

🔹 Links

🔍 Custom DOM Filtering

▸ Single element

▸ Multiple elements

▸ Clean text instead of raw HTML

🗂 Project Structure

🧑‍💻 Development

Contributing

📝 Changelog

📄 License

🔗 Related Projects

💡 Why SnakyScraper?

🇮🇩 Bahasa Indonesia

🚀 Fitur

📦 Instalasi

🛠️ Penggunaan Dasar

Parsing HTML yang sudah dimiliki (tanpa request jaringan)

⚠️ Penanganan Error

🔍 Filtering DOM Khusus

🧑‍💻 Development

📄 Lisensi

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes