Skip to main content

Lightning-fast HTML parser and data extractor with WebPage API - BeautifulSoup alternative built in Rust

Project description

RusticSoup ๐Ÿฆ€๐Ÿฒ

Lightning-fast HTML parser and data extractor built in Rust

PyPI version Python versions License: MIT

๐Ÿš€ Why RusticSoup?

Feature BeautifulSoup RusticSoup Speedup
Google Shopping 8.1ms 3.9ms 2.1x faster
Product grids 14ms 1.2ms 12x faster
Bulk processing Sequential Parallel Up to 100x faster
Attribute extraction Manual loops @href syntax Zero loops needed
WebPage API โŒ โœ… web-poet inspired
CSS selectors โœ… โœ… Same API
Memory usage High Low Rust efficiency

โšก Quick Start

pip install rusticsoup

Option 1: WebPage API (Recommended - web-poet style)

from rusticsoup import WebPage

html = """
<div class="product">
    <h2>Amazing Product</h2>
    <span class="price">$29.99</span>
    <a href="/buy" class="buy-btn">Buy Now</a>
    <img src="/image.jpg" alt="product">
</div>
"""

# Create a WebPage
page = WebPage(html, url="https://example.com/products")

# Extract single values
title = page.text("h2")                    # "Amazing Product"
price = page.text("span.price")            # "$29.99"
link = page.attr("a.buy-btn", "href")      # "/buy"

# Or extract structured data
product = page.extract({
    "title": "h2",
    "price": "span.price",
    "link": "a.buy-btn@href",   # @ syntax for attributes
    "image": "img@src"
})
# {'title': 'Amazing Product', 'price': '$29.99', 'link': '/buy', 'image': '/image.jpg'}

Option 2: Universal Extraction (Original API)

import rusticsoup

# Define what you want to extract
field_mappings = {
    "title": "h2",              # Text content
    "price": "span.price",      # Text content
    "link": "a.buy-btn@href",   # Attribute extraction with @
    "image": "img@src"          # Any attribute: @src, @href, @alt, etc.
}

# Extract data - no manual loops, no site-specific logic
products = rusticsoup.extract_data(html, "div.product", field_mappings)

print(products)
# [{"title": "Amazing Product", "price": "$29.99", "link": "/buy", "image": "/image.jpg"}]

๐Ÿ“š Documentation & Examples

๐ŸŽฏ Core Features

๐ŸŒŸ NEW: WebPage API (web-poet inspired)

High-level, declarative API for web scraping:

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

# Simple extraction
title = page.text("h1")
links = page.attr_all("a", "href")

# Extract multiple items at once
products = page.extract_all(".product", {
    "name": "h2",
    "price": ".price",
    "url": "a@href"
})

# Check existence
if page.has("nav.menu"):
    nav_items = page.text_all("nav.menu a")

# URL resolution
absolute_url = page.absolute_url("/products/123")

๐Ÿ“– Full WebPage API Documentation | ๐Ÿš€ Quick Start Guide | ๐Ÿ†˜ Help Center | **๐Ÿงช Examples

โœ… Universal Extraction

Works with any HTML structure - no site-specific parsers needed:

# Google Shopping
rusticsoup.extract_data(html, 'tr[data-is-grid-offer="true"]', {
    'seller': 'a.b5ycib',
    'price': 'span.g9WBQb',
    'link': 'a.UxuaJe@href'
})

# Amazon Products
rusticsoup.extract_data(html, '[data-component-type="s-search-result"]', {
    'title': 'h2 a span',
    'price': '.a-price-whole',
    'rating': '.a-icon-alt',
    'url': 'h2 a@href'
})

# Any website
rusticsoup.extract_data(html, 'your-container-selector', {
    'any_field': 'any.css.selector',
    'any_attribute': 'element@attribute_name'
})

โœ… Bulk Processing

Process multiple pages in parallel:

# Process 100 pages simultaneously
pages = [html1, html2, html3, ...]  # List of HTML strings
results = rusticsoup.extract_data_bulk(pages, "div.product", field_mappings)

# Each page processed in parallel using Rust's Rayon
# 10-100x faster than sequential processing

โœ… Attribute Extraction

No more manual loops for getting href, src, etc:

# Before (BeautifulSoup)
links = []
for element in soup.select('a'):
    if element.get('href'):
        links.append(element['href'])

# After (RusticSoup)
data = rusticsoup.extract_data(html, 'div', {'links': 'a@href'})

โœ… Browser-Grade Parsing

Built on html5ever - the same HTML parser used by Firefox and Servo:

  • Handles malformed HTML perfectly
  • WHATWG HTML5 compliant
  • Blazing fast C-level performance
  • Memory safe (Rust)

๐Ÿ“Š Performance Benchmarks

Real-world scraping performance vs BeautifulSoup:

# Google Shopping: 30 ads per page
BeautifulSoup:  8.1ms per page
RusticSoup:     3.9ms per page  (2.1x faster)

# Product grids: 50 products per page
BeautifulSoup:  14ms per page
RusticSoup:     1.2ms per page  (12x faster)

# Bulk processing: 100 pages
BeautifulSoup:  Sequential ~1.4s
RusticSoup:     Parallel ~14ms   (100x faster)

๐Ÿ› ๏ธ API Reference

Two Powerful APIs

RusticSoup provides two complementary APIs:

  1. WebPage API - High-level, object-oriented (Recommended for new projects)
  2. Universal Extraction API - Function-based, great for batch processing

WebPage API

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

Key Methods:

  • text(selector) - Extract text from first match
  • text_all(selector) - Extract text from all matches
  • attr(selector, attribute) - Extract attribute from first match
  • attr_all(selector, attribute) - Extract attribute from all matches
  • extract(mappings) - Extract structured data
  • extract_all(container, mappings) - Extract multiple items
  • has(selector) - Check if selector matches
  • count(selector) - Count matching elements
  • absolute_url(url) - Convert relative to absolute URL

๐Ÿ“– Full WebPage Documentation

๐Ÿ”„ Field Transforms (NEW in v0.2.2)

Apply transformations to extracted data automatically:

from rusticsoup import WebPage, Field
from rusticsoup_helpers import ItemPage

class Article(ItemPage):
    # Single transform
    title = Field(css="h1", transform=str.upper)

    # Chain multiple transforms
    author = Field(
        css=".author",
        transform=[
            str.strip,
            str.title,
            lambda s: s.replace("by ", "")
        ]
    )

    # Transform with attribute extraction
    price = Field(
        css=".price",
        transform=[
            str.strip,
            lambda s: float(s.replace("$", ""))
        ]
    )

    # Transform lists
    tags = Field(
        css=".tag",
        get_all=True,
        transform=lambda tags: [t.upper() for t in tags]
    )

page = WebPage(html)
article = Article(page)

print(article.title)   # "UNDERSTANDING RUST"
print(article.author)  # "Jane Smith"
print(article.price)   # 19.99
print(article.tags)    # ["PYTHON", "RUST", "WEB"]

Benefits:

  • โœ… No manual post-processing needed
  • โœ… Clean, declarative field definitions
  • โœ… Reusable transform functions
  • โœ… Chain multiple transforms in order
  • โœ… Works with single values, lists, and attributes

๐Ÿ“– Full Transform Documentation

Universal Extraction API

extract_data(html, container_selector, field_mappings)

Universal HTML data extraction - works with any website structure.

Parameters:

  • html: HTML string to parse
  • container_selector: CSS selector for container elements
  • field_mappings: Dict mapping field names to CSS selectors

Returns: List of dictionaries with extracted data

extract_data_bulk(html_pages, container_selector, field_mappings)

Parallel processing of multiple HTML pages.

Parameters:

  • html_pages: List of HTML strings
  • container_selector: CSS selector for container elements
  • field_mappings: Dict mapping field names to CSS selectors

Returns: List of lists - one result list per input page

parse_html(html)

Low-level HTML parsing - returns WebScraper object for manual DOM traversal.

Parameters:

  • html: HTML string to parse

Returns: WebScraper object with select(), text(), attr() methods

Selector Syntax

Syntax Description Example
"selector" Extract text content "h1" โ†’ "Page Title"
"selector@attr" Extract attribute "a@href" โ†’ "/page.html"
"selector@get_all" Extract all text "p@get_all" โ†’ ["P1", "P2"]
"complex selector" Any CSS selector "div.class > p:first-child"

Supported Attributes

Any HTML attribute: @href, @src, @alt, @class, @id, @data-*, etc.

๐Ÿ—๏ธ Advanced Usage

Custom Processing

# Extract data then post-process
ads = rusticsoup.extract_data(html, "tr.ad", {
    "price": "span.price",
    "link": "a@href"
})

# Post-process the results
for ad in ads:
    # Clean price: "$29.99" โ†’ 29.99
    ad["price"] = float(ad["price"].replace("$", ""))

    # Convert relative URLs to absolute
    if ad["link"].startswith("/"):
        ad["link"] = f"https://example.com{ad['link']}"

Table Extraction

# Extract HTML tables easily
table_data = rusticsoup.extract_table_data(html, "table.data")
# Returns: [["Header1", "Header2"], ["Row1Col1", "Row1Col2"], ...]

Error Handling

try:
    data = rusticsoup.extract_data(html, "div.product", field_mappings)
except Exception as e:
    print(f"Parsing error: {e}")
    data = []

๐Ÿ†š Migration from BeautifulSoup

Option 1: WebPage API (Recommended)

# BeautifulSoup - Imperative, verbose
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
products = []

for product in soup.select('div.product'):
    title = product.select_one('h2')
    price = product.select_one('span.price')
    link = product.select_one('a')

    products.append({
        'title': title.text if title else '',
        'price': price.text if price else '',
        'link': link.get('href') if link else ''
    })

# RusticSoup WebPage - Declarative, concise
from rusticsoup import WebPage

page = WebPage(html)
products = page.extract_all('div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

Option 2: Universal Extraction API

# RusticSoup Universal API - Function-based
import rusticsoup

products = rusticsoup.extract_data(html, 'div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

90% less code, 2-10x faster, handles attributes automatically!

web-poet to RusticSoup

RusticSoup's WebPage API is compatible with web-poet patterns:

# web-poet (async, slower)
from web_poet import WebPage

async def parse(page: WebPage):
    title = await page.css("h1::text").get()
    links = await page.css("a::attr(href)").getall()
    return {"title": title, "links": links}

# RusticSoup WebPage (sync, faster - no async needed!)
from rusticsoup import WebPage

def parse(html: str):
    page = WebPage(html)
    title = page.text("h1")
    links = page.attr_all("a", "href")
    return {"title": title, "links": links}

๐Ÿ”ง Installation

From PyPI (Recommended)

pip install rusticsoup

From Source

# Requires Rust toolchain
git clone https://github.com/yourusername/rusticsoup
cd rusticsoup
maturin develop --release

System Requirements

  • Python 3.11+
  • No additional dependencies (self-contained)

๐Ÿ“ˆ Use Cases

Perfect for:

  • Web scraping - Extract data from any website
  • Data mining - Process large amounts of HTML
  • Price monitoring - Track e-commerce prices
  • Content aggregation - Collect articles, posts, listings
  • SEO analysis - Extract meta tags, titles, links
  • API alternatives - Scrape when no API exists

๐Ÿค Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

  1. Fork the repository
  2. Create your feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built on html5ever - Mozilla's HTML5 parser
  • Powered by scraper - CSS selector support
  • Inspired by BeautifulSoup - the original HTML parsing library
  • WebPage API inspired by web-poet - declarative web scraping

Made with ๐Ÿฆ€ and โค๏ธ - RusticSoup: Where Rust meets HTML parsing perfection

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rusticsoup-0.3.0.tar.gz (4.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rusticsoup-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

rusticsoup-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

rusticsoup-0.3.0-cp39-abi3-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file rusticsoup-0.3.0.tar.gz.

File metadata

  • Download URL: rusticsoup-0.3.0.tar.gz
  • Upload date:
  • Size: 4.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for rusticsoup-0.3.0.tar.gz
Algorithm Hash digest
SHA256 211eb9f9799ebe0947d78b531484c56a70a8ab9815661cc4c7fa7d0d73e38bb3
MD5 e9882ffad4d61769f45a19ba4976327f
BLAKE2b-256 1d6163a92b78276978a66cc960997160f04b4342a6d8b35ed25c28a5021c3150

See more details on using hashes here.

File details

Details for the file rusticsoup-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rusticsoup-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 afd651186bd92f20cedc44db2bf9d3a6e865f6a47bb99e807d55e75a3564f331
MD5 c805781258830646cfed25f4a2c4fb1f
BLAKE2b-256 b670f58a1802d68f18c38dea7ae6567e852c6ce8caa779a8aa481a3480fccd50

See more details on using hashes here.

File details

Details for the file rusticsoup-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rusticsoup-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8817f910aa8f3a578ffc29821e7de4b2f2d023fbb0c0838cb5fe1cafb31aa8f2
MD5 6cc5195bde15d216cf6448047379830d
BLAKE2b-256 bbb1cae5c9472bbc7d2e9500a51f61b828852992a6ea92645565d628b8d213f1

See more details on using hashes here.

File details

Details for the file rusticsoup-0.3.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rusticsoup-0.3.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1b39609104cf49ed92b067edc40e7461a45bdb4115a40ea5844b0a2b358c0648
MD5 2b185dd977dbd113bd85f2f8ac6bbdc5
BLAKE2b-256 032c9c015962db282ea10529e8f05c099952a6b4efab9b30878ec3644eed1808

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page