Lightning-fast HTML parser and data extractor with WebPage API - BeautifulSoup alternative built in Rust
Project description
RusticSoup 🦀🍲
Lightning-fast HTML parser and data extractor built in Rust
🚀 Why RusticSoup?
| Feature | BeautifulSoup | RusticSoup | Speedup |
|---|---|---|---|
| Google Shopping | 8.1ms | 3.9ms | 2.1x faster |
| Product grids | 14ms | 1.2ms | 12x faster |
| Bulk processing | Sequential | Parallel | Up to 100x faster |
| Attribute extraction | Manual loops | @href syntax |
Zero loops needed |
| WebPage API | ❌ | ✅ | web-poet inspired |
| CSS selectors | ✅ | ✅ | Same API |
| Memory usage | High | Low | Rust efficiency |
⚡ Quick Start
pip install rusticsoup
Option 1: WebPage API (Recommended - web-poet style)
from rusticsoup import WebPage
html = """
<div class="product">
<h2>Amazing Product</h2>
<span class="price">$29.99</span>
<a href="/buy" class="buy-btn">Buy Now</a>
<img src="/image.jpg" alt="product">
</div>
"""
# Create a WebPage
page = WebPage(html, url="https://example.com/products")
# Extract single values
title = page.text("h2") # "Amazing Product"
price = page.text("span.price") # "$29.99"
link = page.attr("a.buy-btn", "href") # "/buy"
# Or extract structured data
product = page.extract({
"title": "h2",
"price": "span.price",
"link": "a.buy-btn@href", # @ syntax for attributes
"image": "img@src"
})
# {'title': 'Amazing Product', 'price': '$29.99', 'link': '/buy', 'image': '/image.jpg'}
Option 2: Universal Extraction (Original API)
import rusticsoup
# Define what you want to extract
field_mappings = {
"title": "h2", # Text content
"price": "span.price", # Text content
"link": "a.buy-btn@href", # Attribute extraction with @
"image": "img@src" # Any attribute: @src, @href, @alt, etc.
}
# Extract data - no manual loops, no site-specific logic
products = rusticsoup.extract_data(html, "div.product", field_mappings)
print(products)
# [{"title": "Amazing Product", "price": "$29.99", "link": "/buy", "image": "/image.jpg"}]
🎯 Core Features
🌟 NEW: WebPage API (web-poet inspired)
High-level, declarative API for web scraping:
from rusticsoup import WebPage
page = WebPage(html, url="https://example.com")
# Simple extraction
title = page.text("h1")
links = page.attr_all("a", "href")
# Extract multiple items at once
products = page.extract_all(".product", {
"name": "h2",
"price": ".price",
"url": "a@href"
})
# Check existence
if page.has("nav.menu"):
nav_items = page.text_all("nav.menu a")
# URL resolution
absolute_url = page.absolute_url("/products/123")
📖 Full WebPage API Documentation | 🚀 Quick Start Guide
✅ Universal Extraction
Works with any HTML structure - no site-specific parsers needed:
# Google Shopping
rusticsoup.extract_data(html, 'tr[data-is-grid-offer="true"]', {
'seller': 'a.b5ycib',
'price': 'span.g9WBQb',
'link': 'a.UxuaJe@href'
})
# Amazon Products
rusticsoup.extract_data(html, '[data-component-type="s-search-result"]', {
'title': 'h2 a span',
'price': '.a-price-whole',
'rating': '.a-icon-alt',
'url': 'h2 a@href'
})
# Any website
rusticsoup.extract_data(html, 'your-container-selector', {
'any_field': 'any.css.selector',
'any_attribute': 'element@attribute_name'
})
✅ Bulk Processing
Process multiple pages in parallel:
# Process 100 pages simultaneously
pages = [html1, html2, html3, ...] # List of HTML strings
results = rusticsoup.extract_data_bulk(pages, "div.product", field_mappings)
# Each page processed in parallel using Rust's Rayon
# 10-100x faster than sequential processing
✅ Attribute Extraction
No more manual loops for getting href, src, etc:
# Before (BeautifulSoup)
links = []
for element in soup.select('a'):
if element.get('href'):
links.append(element['href'])
# After (RusticSoup)
data = rusticsoup.extract_data(html, 'div', {'links': 'a@href'})
✅ Browser-Grade Parsing
Built on html5ever - the same HTML parser used by Firefox and Servo:
- Handles malformed HTML perfectly
- WHATWG HTML5 compliant
- Blazing fast C-level performance
- Memory safe (Rust)
📊 Performance Benchmarks
Real-world scraping performance vs BeautifulSoup:
# Google Shopping: 30 ads per page
BeautifulSoup: 8.1ms per page
RusticSoup: 3.9ms per page (2.1x faster)
# Product grids: 50 products per page
BeautifulSoup: 14ms per page
RusticSoup: 1.2ms per page (12x faster)
# Bulk processing: 100 pages
BeautifulSoup: Sequential ~1.4s
RusticSoup: Parallel ~14ms (100x faster)
🛠️ API Reference
Two Powerful APIs
RusticSoup provides two complementary APIs:
- WebPage API - High-level, object-oriented (Recommended for new projects)
- Universal Extraction API - Function-based, great for batch processing
WebPage API
from rusticsoup import WebPage
page = WebPage(html, url="https://example.com")
Key Methods:
text(selector)- Extract text from first matchtext_all(selector)- Extract text from all matchesattr(selector, attribute)- Extract attribute from first matchattr_all(selector, attribute)- Extract attribute from all matchesextract(mappings)- Extract structured dataextract_all(container, mappings)- Extract multiple itemshas(selector)- Check if selector matchescount(selector)- Count matching elementsabsolute_url(url)- Convert relative to absolute URL
🔄 Field Transforms (NEW in v0.2.2)
Apply transformations to extracted data automatically:
from rusticsoup import WebPage, Field
from rusticsoup_helpers import ItemPage
class Article(ItemPage):
# Single transform
title = Field(css="h1", transform=str.upper)
# Chain multiple transforms
author = Field(
css=".author",
transform=[
str.strip,
str.title,
lambda s: s.replace("by ", "")
]
)
# Transform with attribute extraction
price = Field(
css=".price",
transform=[
str.strip,
lambda s: float(s.replace("$", ""))
]
)
# Transform lists
tags = Field(
css=".tag",
get_all=True,
transform=lambda tags: [t.upper() for t in tags]
)
page = WebPage(html)
article = Article(page)
print(article.title) # "UNDERSTANDING RUST"
print(article.author) # "Jane Smith"
print(article.price) # 19.99
print(article.tags) # ["PYTHON", "RUST", "WEB"]
Benefits:
- ✅ No manual post-processing needed
- ✅ Clean, declarative field definitions
- ✅ Reusable transform functions
- ✅ Chain multiple transforms in order
- ✅ Works with single values, lists, and attributes
📖 Full Transform Documentation
Universal Extraction API
extract_data(html, container_selector, field_mappings)
Universal HTML data extraction - works with any website structure.
Parameters:
html: HTML string to parsecontainer_selector: CSS selector for container elementsfield_mappings: Dict mapping field names to CSS selectors
Returns: List of dictionaries with extracted data
extract_data_bulk(html_pages, container_selector, field_mappings)
Parallel processing of multiple HTML pages.
Parameters:
html_pages: List of HTML stringscontainer_selector: CSS selector for container elementsfield_mappings: Dict mapping field names to CSS selectors
Returns: List of lists - one result list per input page
parse_html(html)
Low-level HTML parsing - returns WebScraper object for manual DOM traversal.
Parameters:
html: HTML string to parse
Returns: WebScraper object with select(), text(), attr() methods
Selector Syntax
| Syntax | Description | Example |
|---|---|---|
"selector" |
Extract text content | "h1" → "Page Title" |
"selector@attr" |
Extract attribute | "a@href" → "/page.html" |
"selector@get_all" |
Extract all text | "p@get_all" → ["P1", "P2"] |
"complex selector" |
Any CSS selector | "div.class > p:first-child" |
Supported Attributes
Any HTML attribute: @href, @src, @alt, @class, @id, @data-*, etc.
🏗️ Advanced Usage
Custom Processing
# Extract data then post-process
ads = rusticsoup.extract_data(html, "tr.ad", {
"price": "span.price",
"link": "a@href"
})
# Post-process the results
for ad in ads:
# Clean price: "$29.99" → 29.99
ad["price"] = float(ad["price"].replace("$", ""))
# Convert relative URLs to absolute
if ad["link"].startswith("/"):
ad["link"] = f"https://example.com{ad['link']}"
Table Extraction
# Extract HTML tables easily
table_data = rusticsoup.extract_table_data(html, "table.data")
# Returns: [["Header1", "Header2"], ["Row1Col1", "Row1Col2"], ...]
Error Handling
try:
data = rusticsoup.extract_data(html, "div.product", field_mappings)
except Exception as e:
print(f"Parsing error: {e}")
data = []
🆚 Migration from BeautifulSoup
Option 1: WebPage API (Recommended)
# BeautifulSoup - Imperative, verbose
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
products = []
for product in soup.select('div.product'):
title = product.select_one('h2')
price = product.select_one('span.price')
link = product.select_one('a')
products.append({
'title': title.text if title else '',
'price': price.text if price else '',
'link': link.get('href') if link else ''
})
# RusticSoup WebPage - Declarative, concise
from rusticsoup import WebPage
page = WebPage(html)
products = page.extract_all('div.product', {
'title': 'h2',
'price': 'span.price',
'link': 'a@href'
})
Option 2: Universal Extraction API
# RusticSoup Universal API - Function-based
import rusticsoup
products = rusticsoup.extract_data(html, 'div.product', {
'title': 'h2',
'price': 'span.price',
'link': 'a@href'
})
90% less code, 2-10x faster, handles attributes automatically!
web-poet to RusticSoup
RusticSoup's WebPage API is compatible with web-poet patterns:
# web-poet (async, slower)
from web_poet import WebPage
async def parse(page: WebPage):
title = await page.css("h1::text").get()
links = await page.css("a::attr(href)").getall()
return {"title": title, "links": links}
# RusticSoup WebPage (sync, faster - no async needed!)
from rusticsoup import WebPage
def parse(html: str):
page = WebPage(html)
title = page.text("h1")
links = page.attr_all("a", "href")
return {"title": title, "links": links}
🔧 Installation
From PyPI (Recommended)
pip install rusticsoup
From Source
# Requires Rust toolchain
git clone https://github.com/yourusername/rusticsoup
cd rusticsoup
maturin develop --release
System Requirements
- Python 3.11+
- No additional dependencies (self-contained)
📈 Use Cases
Perfect for:
- Web scraping - Extract data from any website
- Data mining - Process large amounts of HTML
- Price monitoring - Track e-commerce prices
- Content aggregation - Collect articles, posts, listings
- SEO analysis - Extract meta tags, titles, links
- API alternatives - Scrape when no API exists
🤝 Contributing
Contributions welcome! Please read CONTRIBUTING.md first.
- Fork the repository
- Create your feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
- Built on html5ever - Mozilla's HTML5 parser
- Powered by scraper - CSS selector support
- Inspired by BeautifulSoup - the original HTML parsing library
- WebPage API inspired by web-poet - declarative web scraping
Made with 🦀 and ❤️ - RusticSoup: Where Rust meets HTML parsing perfection
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rusticsoup-0.2.25.tar.gz.
File metadata
- Download URL: rusticsoup-0.2.25.tar.gz
- Upload date:
- Size: 2.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89b929286907e3fa97adc8af970c3433b2f9a1ad3ec038e925cda55f8782b30e
|
|
| MD5 |
62e958432ddb8784868b7f76d41f7ce8
|
|
| BLAKE2b-256 |
ae0ea32828b95e7d37a648a4be11f40f1b652d4cf9f339c1cfd2aade3625e3b9
|
File details
Details for the file rusticsoup-0.2.25-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rusticsoup-0.2.25-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e96073f0fd5e136084d4c0fa334b8fa7ec61f3e1ef94b96e5cb4e633a54e4c5
|
|
| MD5 |
5bd463d9da7b7cd436b269ce97cbb233
|
|
| BLAKE2b-256 |
5705871b9d4c61ce2a1984b95611f54202fbfd56ff70cbd9be38d60f1295f260
|
File details
Details for the file rusticsoup-0.2.25-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: rusticsoup-0.2.25-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d30f41b900fa78284093c08350819f286ca6edd7f6a0344b2e1253a558cf5926
|
|
| MD5 |
f7d00232d305f801dff53352bacaa126
|
|
| BLAKE2b-256 |
1a8956ae09b766ab1e16a79ba12a1444273bb6c99ae9029209c4792dba63b0f4
|
File details
Details for the file rusticsoup-0.2.25-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: rusticsoup-0.2.25-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 987.8 kB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf2c34f1307784454a4c7374e3de51cef507a04c8f45e5210a443cea81170f91
|
|
| MD5 |
1251312f03b5738c0f7a2652dd23d173
|
|
| BLAKE2b-256 |
da94f4fe2da2e36ae3c539b72d825334dc802bb7fff521bd7742ee1891a28c35
|