Lightning-fast HTML parser and data extractor with WebPage API - BeautifulSoup alternative built in Rust

These details have not been verified by PyPI

Project links

Project description

RusticSoup 🦀🍲

Lightning-fast HTML parser and data extractor built in Rust

🚀 Why RusticSoup?

Feature	BeautifulSoup	RusticSoup	Speedup
Google Shopping	8.1ms	3.9ms	2.1x faster
Product grids	14ms	1.2ms	12x faster
Bulk processing	Sequential	Parallel	Up to 100x faster
Attribute extraction	Manual loops	`@href` syntax	Zero loops needed
WebPage API	❌	✅	web-poet inspired
CSS selectors	✅	✅	Same API
Memory usage	High	Low	Rust efficiency

⚡ Quick Start

pip install rusticsoup

Option 1: WebPage API (Recommended - web-poet style)

from rusticsoup import WebPage

html = """
<div class="product">
    <h2>Amazing Product</h2>
    <span class="price">$29.99</span>
    <a href="/buy" class="buy-btn">Buy Now</a>
    <img src="/image.jpg" alt="product">
</div>
"""

# Create a WebPage
page = WebPage(html, url="https://example.com/products")

# Extract single values
title = page.text("h2")                    # "Amazing Product"
price = page.text("span.price")            # "$29.99"
link = page.attr("a.buy-btn", "href")      # "/buy"

# Or extract structured data
product = page.extract({
    "title": "h2",
    "price": "span.price",
    "link": "a.buy-btn@href",   # @ syntax for attributes
    "image": "img@src"
})
# {'title': 'Amazing Product', 'price': '$29.99', 'link': '/buy', 'image': '/image.jpg'}

Option 2: Universal Extraction (Original API)

import rusticsoup

# Define what you want to extract
field_mappings = {
    "title": "h2",              # Text content
    "price": "span.price",      # Text content
    "link": "a.buy-btn@href",   # Attribute extraction with @
    "image": "img@src"          # Any attribute: @src, @href, @alt, etc.
}

# Extract data - no manual loops, no site-specific logic
products = rusticsoup.extract_data(html, "div.product", field_mappings)

print(products)
# [{"title": "Amazing Product", "price": "$29.99", "link": "/buy", "image": "/image.jpg"}]

🎯 Core Features

🌟 NEW: WebPage API (web-poet inspired)

High-level, declarative API for web scraping:

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

# Simple extraction
title = page.text("h1")
links = page.attr_all("a", "href")

# Extract multiple items at once
products = page.extract_all(".product", {
    "name": "h2",
    "price": ".price",
    "url": "a@href"
})

# Check existence
if page.has("nav.menu"):
    nav_items = page.text_all("nav.menu a")

# URL resolution
absolute_url = page.absolute_url("/products/123")

📖 Full WebPage API Documentation | 🚀 Quick Start Guide

✅ Universal Extraction

Works with any HTML structure - no site-specific parsers needed:

# Google Shopping
rusticsoup.extract_data(html, 'tr[data-is-grid-offer="true"]', {
    'seller': 'a.b5ycib',
    'price': 'span.g9WBQb',
    'link': 'a.UxuaJe@href'
})

# Amazon Products
rusticsoup.extract_data(html, '[data-component-type="s-search-result"]', {
    'title': 'h2 a span',
    'price': '.a-price-whole',
    'rating': '.a-icon-alt',
    'url': 'h2 a@href'
})

# Any website
rusticsoup.extract_data(html, 'your-container-selector', {
    'any_field': 'any.css.selector',
    'any_attribute': 'element@attribute_name'
})

✅ Bulk Processing

Process multiple pages in parallel:

# Process 100 pages simultaneously
pages = [html1, html2, html3, ...]  # List of HTML strings
results = rusticsoup.extract_data_bulk(pages, "div.product", field_mappings)

# Each page processed in parallel using Rust's Rayon
# 10-100x faster than sequential processing

✅ Attribute Extraction

No more manual loops for getting href, src, etc:

# Before (BeautifulSoup)
links = []
for element in soup.select('a'):
    if element.get('href'):
        links.append(element['href'])

# After (RusticSoup)
data = rusticsoup.extract_data(html, 'div', {'links': 'a@href'})

✅ Browser-Grade Parsing

Built on html5ever - the same HTML parser used by Firefox and Servo:

Handles malformed HTML perfectly
WHATWG HTML5 compliant
Blazing fast C-level performance
Memory safe (Rust)

📊 Performance Benchmarks

Real-world scraping performance vs BeautifulSoup:

# Google Shopping: 30 ads per page
BeautifulSoup:  8.1ms per page
RusticSoup:     3.9ms per page  (2.1x faster)

# Product grids: 50 products per page
BeautifulSoup:  14ms per page
RusticSoup:     1.2ms per page  (12x faster)

# Bulk processing: 100 pages
BeautifulSoup:  Sequential ~1.4s
RusticSoup:     Parallel ~14ms   (100x faster)

🛠️ API Reference

Two Powerful APIs

RusticSoup provides two complementary APIs:

WebPage API - High-level, object-oriented (Recommended for new projects)
Universal Extraction API - Function-based, great for batch processing

WebPage API

from rusticsoup import WebPage

page = WebPage(html, url="https://example.com")

Key Methods:

text(selector) - Extract text from first match
text_all(selector) - Extract text from all matches
attr(selector, attribute) - Extract attribute from first match
attr_all(selector, attribute) - Extract attribute from all matches
extract(mappings) - Extract structured data
extract_all(container, mappings) - Extract multiple items
has(selector) - Check if selector matches
count(selector) - Count matching elements
absolute_url(url) - Convert relative to absolute URL

📖 Full WebPage Documentation

🔄 Field Transforms (NEW in v0.2.2)

Apply transformations to extracted data automatically:

from rusticsoup import WebPage, Field
from rusticsoup_helpers import ItemPage

class Article(ItemPage):
    # Single transform
    title = Field(css="h1", transform=str.upper)

    # Chain multiple transforms
    author = Field(
        css=".author",
        transform=[
            str.strip,
            str.title,
            lambda s: s.replace("by ", "")
        ]
    )

    # Transform with attribute extraction
    price = Field(
        css=".price",
        transform=[
            str.strip,
            lambda s: float(s.replace("$", ""))
        ]
    )

    # Transform lists
    tags = Field(
        css=".tag",
        get_all=True,
        transform=lambda tags: [t.upper() for t in tags]
    )

page = WebPage(html)
article = Article(page)

print(article.title)   # "UNDERSTANDING RUST"
print(article.author)  # "Jane Smith"
print(article.price)   # 19.99
print(article.tags)    # ["PYTHON", "RUST", "WEB"]

Benefits:

✅ No manual post-processing needed
✅ Clean, declarative field definitions
✅ Reusable transform functions
✅ Chain multiple transforms in order
✅ Works with single values, lists, and attributes

📖 Full Transform Documentation

Universal Extraction API

`extract_data(html, container_selector, field_mappings)`

Universal HTML data extraction - works with any website structure.

Parameters:

html: HTML string to parse
container_selector: CSS selector for container elements
field_mappings: Dict mapping field names to CSS selectors

Returns: List of dictionaries with extracted data

`extract_data_bulk(html_pages, container_selector, field_mappings)`

Parallel processing of multiple HTML pages.

Parameters:

html_pages: List of HTML strings
container_selector: CSS selector for container elements
field_mappings: Dict mapping field names to CSS selectors

Returns: List of lists - one result list per input page

`parse_html(html)`

Low-level HTML parsing - returns WebScraper object for manual DOM traversal.

Parameters:

html: HTML string to parse

Returns: WebScraper object with select(), text(), attr() methods

Selector Syntax

Syntax	Description	Example
`"selector"`	Extract text content	`"h1"` → "Page Title"
`"selector@attr"`	Extract attribute	`"a@href"` → "/page.html"
`"selector@get_all"`	Extract all text	`"p@get_all"` → ["P1", "P2"]
`"complex selector"`	Any CSS selector	`"div.class > p:first-child"`

Supported Attributes

Any HTML attribute: @href, @src, @alt, @class, @id, @data-*, etc.

🏗️ Advanced Usage

Custom Processing

# Extract data then post-process
ads = rusticsoup.extract_data(html, "tr.ad", {
    "price": "span.price",
    "link": "a@href"
})

# Post-process the results
for ad in ads:
    # Clean price: "$29.99" → 29.99
    ad["price"] = float(ad["price"].replace("$", ""))

    # Convert relative URLs to absolute
    if ad["link"].startswith("/"):
        ad["link"] = f"https://example.com{ad['link']}"

Table Extraction

# Extract HTML tables easily
table_data = rusticsoup.extract_table_data(html, "table.data")
# Returns: [["Header1", "Header2"], ["Row1Col1", "Row1Col2"], ...]

Error Handling

try:
    data = rusticsoup.extract_data(html, "div.product", field_mappings)
except Exception as e:
    print(f"Parsing error: {e}")
    data = []

🆚 Migration from BeautifulSoup

Option 1: WebPage API (Recommended)

# BeautifulSoup - Imperative, verbose
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
products = []

for product in soup.select('div.product'):
    title = product.select_one('h2')
    price = product.select_one('span.price')
    link = product.select_one('a')

    products.append({
        'title': title.text if title else '',
        'price': price.text if price else '',
        'link': link.get('href') if link else ''
    })

# RusticSoup WebPage - Declarative, concise
from rusticsoup import WebPage

page = WebPage(html)
products = page.extract_all('div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

Option 2: Universal Extraction API

# RusticSoup Universal API - Function-based
import rusticsoup

products = rusticsoup.extract_data(html, 'div.product', {
    'title': 'h2',
    'price': 'span.price',
    'link': 'a@href'
})

90% less code, 2-10x faster, handles attributes automatically!

web-poet to RusticSoup

RusticSoup's WebPage API is compatible with web-poet patterns:

# web-poet (async, slower)
from web_poet import WebPage

async def parse(page: WebPage):
    title = await page.css("h1::text").get()
    links = await page.css("a::attr(href)").getall()
    return {"title": title, "links": links}

# RusticSoup WebPage (sync, faster - no async needed!)
from rusticsoup import WebPage

def parse(html: str):
    page = WebPage(html)
    title = page.text("h1")
    links = page.attr_all("a", "href")
    return {"title": title, "links": links}

🔧 Installation

From PyPI (Recommended)

pip install rusticsoup

From Source

# Requires Rust toolchain
git clone https://github.com/yourusername/rusticsoup
cd rusticsoup
maturin develop --release

System Requirements

Python 3.11+
No additional dependencies (self-contained)

📈 Use Cases

Perfect for:

Web scraping - Extract data from any website
Data mining - Process large amounts of HTML
Price monitoring - Track e-commerce prices
Content aggregation - Collect articles, posts, listings
SEO analysis - Extract meta tags, titles, links
API alternatives - Scrape when no API exists

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Fork the repository
Create your feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built on html5ever - Mozilla's HTML5 parser
Powered by scraper - CSS selector support
Inspired by BeautifulSoup - the original HTML parsing library
WebPage API inspired by web-poet - declarative web scraping

Made with 🦀 and ❤️ - RusticSoup: Where Rust meets HTML parsing perfection

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Dec 9, 2025

0.4.0

Dec 2, 2025

0.3.0

Nov 8, 2025

0.2.27

Nov 8, 2025

This version

0.2.26

Nov 8, 2025

0.2.25

Nov 8, 2025

0.2.2

Nov 7, 2025

0.2.1

Nov 7, 2025

0.2.0

Nov 7, 2025

0.1.0

Oct 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rusticsoup-0.2.26.tar.gz (4.3 MB view details)

Uploaded Nov 8, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded Nov 8, 2025 CPython 3.9+manylinux: glibc 2.17+ x86-64

rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded Nov 8, 2025 CPython 3.9+manylinux: glibc 2.17+ ARM64

rusticsoup-0.2.26-cp39-abi3-macosx_11_0_arm64.whl (989.5 kB view details)

Uploaded Nov 8, 2025 CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file rusticsoup-0.2.26.tar.gz.

File metadata

Download URL: rusticsoup-0.2.26.tar.gz
Upload date: Nov 8, 2025
Size: 4.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for rusticsoup-0.2.26.tar.gz
Algorithm	Hash digest
SHA256	`a55c6c86f41165debb256919a3ee50d8a010ccef41af7858a0cc92f9d8cc353d`
MD5	`1e32b157f2936b2977cdb3129a4c5f3b`
BLAKE2b-256	`d37a31b931d3bfad279155c5ba111b90f3bed4c7c94ded329fc6f691b2101d94`

See more details on using hashes here.

File details

Details for the file rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Nov 8, 2025
Size: 1.1 MB
Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`65711043f2b2dfd903a7516008123988db8644b560c9b6e70a9ea5e54fc62db9`
MD5	`56ca384c4dde3ca95b651dc2cd883de4`
BLAKE2b-256	`e46649829d27352cefea8bdcd6eb1dfe44cf594cf84a1906b9e3bb7cb4c4eb04`

See more details on using hashes here.

File details

Details for the file rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Nov 8, 2025
Size: 1.1 MB
Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for rusticsoup-0.2.26-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`7232f78f3c36952c43c119d17109bcf8413d4f8231ce88a99dcdd39051bebcb9`
MD5	`85acaf9b14ea3988749813e20cd343b7`
BLAKE2b-256	`bcb7f8dcd6116173282088a1d6e14a5ef4784f71c2841d95aabd29218c38ef04`

See more details on using hashes here.

File details

Details for the file rusticsoup-0.2.26-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: rusticsoup-0.2.26-cp39-abi3-macosx_11_0_arm64.whl
Upload date: Nov 8, 2025
Size: 989.5 kB
Tags: CPython 3.9+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for rusticsoup-0.2.26-cp39-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`6b0a53ec97fe1a234d0be5775c261395c617a14129adb349641327f20c74b2be`
MD5	`c6c40e2098a757a636ba668588a9abd9`
BLAKE2b-256	`575935d07bf9ad8c053278bfcbfbb7718bd946eb3855cfeb4fb70ceb60e849bb`

See more details on using hashes here.

rusticsoup 0.2.26

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RusticSoup 🦀🍲

🚀 Why RusticSoup?

⚡ Quick Start

Option 1: WebPage API (Recommended - web-poet style)

Option 2: Universal Extraction (Original API)

🎯 Core Features

🌟 NEW: WebPage API (web-poet inspired)

✅ Universal Extraction

✅ Bulk Processing

✅ Attribute Extraction

✅ Browser-Grade Parsing

📊 Performance Benchmarks

🛠️ API Reference

Two Powerful APIs

WebPage API

🔄 Field Transforms (NEW in v0.2.2)

Universal Extraction API

extract_data(html, container_selector, field_mappings)

extract_data_bulk(html_pages, container_selector, field_mappings)

parse_html(html)

Selector Syntax

Supported Attributes

🏗️ Advanced Usage

Custom Processing

Table Extraction

Error Handling

🆚 Migration from BeautifulSoup

Option 1: WebPage API (Recommended)

Option 2: Universal Extraction API

web-poet to RusticSoup

🔧 Installation

From PyPI (Recommended)

From Source

System Requirements

📈 Use Cases

🤝 Contributing

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

`extract_data(html, container_selector, field_mappings)`

`extract_data_bulk(html_pages, container_selector, field_mappings)`

`parse_html(html)`