Skip to main content

A lightweight, Pythonic web scraping toolkit built on BeautifulSoup and Requests.

Project description

๐Ÿ SnakyScraper

SnakyScraper is a lightweight, Pythonic web scraping toolkit built on top of BeautifulSoup and Requests. It gives you a clean interface for pulling structured HTML and metadata out of any web page โ€” titles, Open Graph tags, headings, links, images, and arbitrary DOM selectors โ€” with predictable, JSON-friendly return values.

Fast. Accurate. Snake-style scraping. ๐Ÿ๐ŸŽฏ

Bahasa: English | Indonesia


๐Ÿ“‹ Table of Contents


๐Ÿš€ Features

  • โœ… Extract metadata: title, description, keywords, author, charset, canonical URL, and more
  • โœ… Built-in support for Open Graph, Twitter Card, and CSRF tags
  • โœ… Extract HTML structures: h1โ€“h6, p, ul, ol, images, links
  • โœ… Powerful filter() method with tag, class, and ID-based selectors
  • โœ… Parse raw HTML directly with html= โ€” no network call required
  • โœ… Proper error handling: inspect .error / .status_code or opt into exceptions with raise_on_error=True
  • โœ… Custom headers, timeout, and requests.Session reuse for real-world scraping
  • โœ… Fully type-hinted, ships with py.typed (PEP 561) for IDE & mypy support
  • โœ… Zero bare except: blocks, no silent 404/500 pass-throughs
  • โœ… Powered by BeautifulSoup4 and Requests โ€” no heavyweight dependencies

๐Ÿ“ฆ Installation

pip install snakyscraper

Optional extras:

# Faster HTML parsing via lxml
pip install "snakyscraper[lxml]"

# Development tools (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"

Requires Python 3.8 or later.


๐Ÿ› ๏ธ Quick Start

from snakyscraper import SnakyScraper

scraper = SnakyScraper("https://example.com")

if scraper.ok():
    print(scraper.title())          # "Example Domain"
    print(scraper.description())    # meta description, or None
    print(scraper.h1())             # ["Example Domain"]
    print(scraper.open_graph())     # {"og:title": ..., "og:image": ..., ...}
else:
    print("Failed:", scraper.error)

Parsing HTML you already have (no network call)

Useful for tests, cached pages, or HTML obtained from a headless browser:

scraper = SnakyScraper(html="<title>Hello</title><h1>Hi there</h1>")
scraper.title()  # "Hello"
scraper.h1()     # ["Hi there"]

Custom headers, timeout, and session reuse

import requests
from snakyscraper import SnakyScraper

session = requests.Session()
scraper = SnakyScraper(
    "https://example.com",
    timeout=15,
    headers={"Accept-Language": "id-ID,en;q=0.8"},
    session=session,  # reuse connections/cookies across multiple scrapes
)

โš ๏ธ Handling Errors

By default, SnakyScraper never raises โ€” failures (invalid URL, network error, HTTP 4xx/5xx, parse failure) are captured instead of thrown, so a single bad URL in a batch job won't crash the whole run.

scraper = SnakyScraper("https://example.com/this-page-does-not-exist")

scraper.ok()           # False
scraper.status_code    # 404
scraper.error          # HTTPStatusError("'...' returned HTTP 404.")

If you'd rather fail fast (e.g. while developing), pass raise_on_error=True:

from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError

try:
    scraper = SnakyScraper("https://example.com/missing", raise_on_error=True)
except HTTPStatusError as e:
    print("Server returned an error status:", e.status_code)
except InvalidURLError:
    print("That URL is malformed.")
except FetchError as e:
    print("Network problem:", e)

Exception hierarchy:

SnakyScraperError
โ”œโ”€โ”€ InvalidURLError    # bad/missing/non-http(s) URL
โ”œโ”€โ”€ FetchError          # network failure (timeout, DNS, connection refused, ...)
โ”‚   โ””โ”€โ”€ HTTPStatusError # non-2xx response (has .status_code)
โ””โ”€โ”€ ParseError          # HTML could not be parsed

๐Ÿ“– API Reference

๐Ÿ”น Status

Method Returns Description
ok() bool True if the page was fetched and parsed successfully
.error Exception | None The exception captured during construction, if any
.status_code int | None HTTP status code of the response, if a request was made

๐Ÿ”น Page Metadata

Method Returns
title() str | None
description() str | None
keywords() list[str] | None
keyword_string() str | None
charset() str | None โ€” reads both <meta charset> and legacy http-equiv forms
canonical() str | None
content_type() str | None
author() str | None
csrf_token() str | None โ€” checks meta tag, then hidden input
image() str | None โ€” shortcut for og:image
viewport() list[str] | None
viewport_string() str | None

๐Ÿ”น Open Graph & Twitter Card

scraper.open_graph()              # dict of common og:* properties
scraper.open_graph("og:title")    # a single property

scraper.twitter_card()                  # dict of common twitter:* properties
scraper.twitter_card("twitter:title")   # a single property

๐Ÿ”น Headings & Text

scraper.h1()  # list[str]
scraper.h2()
scraper.h3()
scraper.h4()
scraper.h5()
scraper.h6()
scraper.p()

๐Ÿ”น Lists

scraper.ul()  # flattened text of every <li> in every <ul>
scraper.ol()  # flattened text of every <li> in every <ol>

๐Ÿ”น Images

scraper.images()         # ["/img/1.jpg", "/img/2.jpg", ...]
scraper.image_details()  # [{"url": ..., "alt_text": ..., "title": ...}, ...]

๐Ÿ”น Links

scraper.links()         # list of href strings (anchors with no href are skipped)
scraper.link_details()  # list of dicts: url, protocol, text, title, target, rel, is_nofollow, ...

๐Ÿ” Custom DOM Filtering

Use filter() to target specific elements and optionally pull nested content out of them.

โ–ธ Single element

scraper.filter(
    element="div",
    attributes={"id": "main"},
    multiple=False,
    extract=[".title", "#description", "p"],
)

โ–ธ Multiple elements

scraper.filter(
    element="div",
    attributes={"class": "card"},
    multiple=True,
    extract=["h3", ".subtitle", "#meta"],
)

extract selectors: a tag name ("h3"), a class (.title โ†’ key class__title), or an ID (#meta โ†’ key id__meta).

โ–ธ Clean text instead of raw HTML

scraper.filter(
    element="p",
    attributes={"class": "dark-text"},
    multiple=True,
    return_html=False,
)

๐Ÿ—‚ Project Structure

snakyscraper/
โ”œโ”€โ”€ snakyscraper/
โ”‚   โ”œโ”€โ”€ __init__.py       # public API surface (SnakyScraper, exceptions, __version__)
โ”‚   โ”œโ”€โ”€ core.py           # SnakyScraper implementation
โ”‚   โ”œโ”€โ”€ exceptions.py     # SnakyScraperError and subclasses
โ”‚   โ”œโ”€โ”€ _version.py       # single source of truth for the version string
โ”‚   โ””โ”€โ”€ py.typed          # PEP 561 marker
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ conftest.py       # shared fixtures (sample HTML pages)
โ”‚   โ”œโ”€โ”€ test_metadata.py  # title, og, twitter, charset, csrf, ...
โ”‚   โ”œโ”€โ”€ test_content.py   # headings, lists, images, links
โ”‚   โ”œโ”€โ”€ test_filter.py    # filter() DOM queries
โ”‚   โ””โ”€โ”€ test_fetching.py  # URL validation, HTTP mocking, error handling
โ”œโ”€โ”€ examples/
โ”‚   โ””โ”€โ”€ basic_usage.py
โ”œโ”€โ”€ pyproject.toml        # build system + project metadata + tool config
โ”œโ”€โ”€ LICENSE
โ””โ”€โ”€ README.md

This split keeps the public API (__init__.py) thin, the implementation (core.py) self-contained, and error types (exceptions.py) reusable without importing the whole scraping engine โ€” making the codebase easier to navigate and extend.


๐Ÿง‘โ€๐Ÿ’ป Development

git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"

# Run the test suite (mocked HTTP, no real network calls)
pytest

# With coverage
pytest --cov=snakyscraper --cov-report=term-missing

# Type-check
mypy snakyscraper/

# Build distributable wheel/sdist
python -m build

Contributing

Found a bug or want to request a feature? Open an issue or submit a pull request.


๐Ÿ“ Changelog

See CHANGELOG.md for the full version history. Highlights for v1.1.0:

  • Restructured into a proper multi-module package (core, exceptions, _version)
  • Fixed: HTTP error pages (404/500) no longer silently treated as successful
  • Fixed: charset() now reads legacy http-equiv charset declarations
  • Fixed: link_details() no longer breaks on anchors without href
  • Fixed: title() now returns a clean str instead of a NavigableString
  • Added: html= kwarg to parse raw HTML with no network call
  • Added: typed exception hierarchy (InvalidURLError, FetchError, HTTPStatusError, ParseError)
  • Added: .error, .status_code, .ok(), raise_on_error=, custom headers=/session=
  • Added: full type hints + py.typed, full pytest suite (63 tests, 92% coverage)
  • Renamed: project ownership moved from riodevnet to ioodev

๐Ÿ“„ License

MIT License ยฉ 2025โ€“2026 โ€” ioodev


๐Ÿ”— Related Projects


๐Ÿ’ก Why SnakyScraper?

Think of it as your Pythonic sniper โ€” targeting HTML content with precision and elegance.


๐Ÿ‡ฎ๐Ÿ‡ฉ Bahasa Indonesia

SnakyScraper adalah toolkit web scraping yang ringan dan Pythonic, dibangun di atas BeautifulSoup dan Requests. Library ini menyediakan antarmuka yang bersih untuk mengambil HTML terstruktur dan metadata dari halaman web mana pun โ€” judul, tag Open Graph, heading, link, gambar, hingga selector DOM khusus โ€” dengan nilai kembalian yang konsisten dan ramah JSON.

๐Ÿš€ Fitur

  • Ekstraksi metadata: title, description, keywords, author, charset, canonical URL, dan lainnya
  • Dukungan bawaan untuk Open Graph, Twitter Card, dan tag CSRF
  • Ekstraksi struktur HTML: h1โ€“h6, p, ul, ol, gambar, link
  • Metode filter() yang fleksibel dengan selector tag, class, dan ID
  • Bisa parsing HTML langsung lewat html= โ€” tanpa perlu request ke jaringan
  • Penanganan error yang jelas: cek .error / .status_code, atau aktifkan exception dengan raise_on_error=True
  • Dukungan custom headers, timeout, dan reuse requests.Session untuk kebutuhan scraping nyata
  • Type hints lengkap, sudah menyertakan py.typed (PEP 561) untuk dukungan IDE & mypy
  • Tidak ada lagi blok except: kosong, tidak ada lagi halaman 404/500 yang lolos begitu saja

๐Ÿ“ฆ Instalasi

pip install snakyscraper

Ekstra opsional:

# Parsing HTML lebih cepat dengan lxml
pip install "snakyscraper[lxml]"

# Tools development (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"

Membutuhkan Python 3.8 atau lebih baru.

๐Ÿ› ๏ธ Penggunaan Dasar

from snakyscraper import SnakyScraper

scraper = SnakyScraper("https://example.com")

if scraper.ok():
    print(scraper.title())
    print(scraper.description())
    print(scraper.h1())
    print(scraper.open_graph())
else:
    print("Gagal:", scraper.error)

Parsing HTML yang sudah dimiliki (tanpa request jaringan)

scraper = SnakyScraper(html="<title>Halo</title><h1>Selamat datang</h1>")
scraper.title()  # "Halo"
scraper.h1()     # ["Selamat datang"]

โš ๏ธ Penanganan Error

Secara default, SnakyScraper tidak pernah melempar exception โ€” kegagalan (URL tidak valid, error jaringan, HTTP 4xx/5xx, gagal parsing) ditangkap secara internal, sehingga satu URL bermasalah di dalam batch job tidak akan menghentikan seluruh proses.

scraper = SnakyScraper("https://example.com/halaman-tidak-ada")

scraper.ok()           # False
scraper.status_code    # 404
scraper.error          # HTTPStatusError("'...' returned HTTP 404.")

Jika ingin error langsung dilempar sebagai exception (misalnya saat development), gunakan raise_on_error=True:

from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError

try:
    scraper = SnakyScraper("https://example.com/halaman-tidak-ada", raise_on_error=True)
except HTTPStatusError as e:
    print("Server mengembalikan status error:", e.status_code)
except InvalidURLError:
    print("URL tidak valid.")
except FetchError as e:
    print("Masalah jaringan:", e)

๐Ÿ” Filtering DOM Khusus

scraper.filter(
    element="div",
    attributes={"class": "card"},
    multiple=True,
    extract=["h3", ".subtitle", "#meta"],
)

Selector extract: nama tag ("h3"), class (.title โ†’ key class__title), atau ID (#meta โ†’ key id__meta).

๐Ÿง‘โ€๐Ÿ’ป Development

git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"

pytest                  # jalankan test suite
mypy snakyscraper/      # type-check
python -m build         # build wheel/sdist

๐Ÿ“„ Lisensi

MIT License ยฉ 2025โ€“2026 โ€” ioodev

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakyscraper-1.1.0.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snakyscraper-1.1.0-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file snakyscraper-1.1.0.tar.gz.

File metadata

  • Download URL: snakyscraper-1.1.0.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for snakyscraper-1.1.0.tar.gz
Algorithm Hash digest
SHA256 0ea7b5736c4b615705c08350879a542e815f10d7156e38e75dbc752b7e8ef9e6
MD5 fb3a6039a8578a8f0c0607fc965b9476
BLAKE2b-256 70a3b98b66114e23a677fa47a2915ab2858b8b9dd963e4db21098a86e9887c33

See more details on using hashes here.

File details

Details for the file snakyscraper-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: snakyscraper-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for snakyscraper-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d576db6857594b7196b9ecfb7b00492a95dc9857ba78d8a0449479955e9e4d4
MD5 aae0718639ceea5c5f450336d5e06acc
BLAKE2b-256 3a46929da2098162f83c408f1815f3a6c9ac953dfc2384891a1e1cef8b1f9722

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page