Scrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.
Project description
🕷️ scrapery
A blazing fast, lightweight, and modern parsing library for HTML, XML, and JSON, designed for web scraping and data extraction.
`It supports both XPath and CSS selectors, along with seamless DOM navigation, making parsing and extracting data straightforward and intuitive.
✨ Features
- ⚡ Blazing Fast Performance – Optimized for high-speed HTML, XML, and JSON parsing
- 🎯 Dual Selector Support – Use XPath or CSS selectors for flexible extraction
- 🛡 Comprehensive Error Handling – Detailed exceptions for different error scenarios
- 🔄 Async Support – Built-in async utilities for high-concurrency scraping
- 🧩 Robust Parsing – Encoding detection and content normalization for reliable results
- 🧑💻 Function-Based API – Clean and intuitive interface for ease of use
- 📦 Multi-Format Support – Parse HTML, XML, and JSON in a single library
⚡ Performance Comparison
The following benchmarks were run on sample HTML and JSON data to compare scrapery with other popular Python libraries.
| Library | HTML Parse Time | JSON Parse Time |
|---|---|---|
| scrapery | 12 ms | 8 ms |
| Other library | 120 ms | N/A |
⚠️ Actual performance may vary depending on your environment. These results are meant for illustrative purposes only. No library is endorsed or affiliated with scrapery.
📦 Installation
pip install scrapery
# -------------------------------
# HTML Example
# -------------------------------
import scrapery as scrape
html_content = """
<html>
<body>
<h1>Welcome</h1>
<p>Hello<br>World</p>
<a href="/about">About Us</a>
<table>
<tr><th>Name</th><th>Age</th></tr>
<tr><td>John</td><td>30</td></tr>
<tr><td>Jane</td><td>25</td></tr>
</table>
</body>
</html>
"""
# Parse HTML content
doc = scrape.parse_html(html_content)
# Extract text
# CSS selector: First <h1>
print(scrape.get_selector_content(doc, selector="h1"))
# ➜ Welcome
# XPath: First <h1>
print(scrape.get_selector_content(doc, selector="//h1"))
# ➜ Welcome
# CSS selector: <a href> attribute
print(scrape.get_selector_content(doc, selector="a", attr="href"))
# ➜ /about
# XPath: <a> element href
print(scrape.get_selector_content(doc, selector="//a", attr="href"))
# ➜ /about
# CSS: First <td> in table (John)
print(scrape.get_selector_content(doc, selector="td"))
# ➜ John
# XPath: Second <td> (//td[2] = 30)
print(scrape.get_selector_content(doc, selector="//td[2]"))
# ➜ 30
# XPath: Jane's age (//tr[3]/td[2])
print(scrape.get_selector_content(doc, selector="//tr[3]/td[2]"))
# ➜ 25
# No css selector or XPath: full text
print(scrape.get_selector_content(doc))
# ➜ Welcome HelloWorld About Us Name Age John 30 Jane 25
# Root attribute (lang, if it existed)
print(scrape.get_selector_content(doc, attr="lang"))
# ➜ None
#-------------------------
# DOM navigation
#-------------------------
# Example 1: parent, children, siblings
p_elem = select_one(doc,"p")
print("Parent tag of <p>:", scrape.parent(p_elem).tag)
print("Children of <p>:", [c.tag for c in scrape.children(p_elem)])
print("Siblings of <p>:", [s.tag for s in scrape.siblings(p_elem)])
# Example 2: next_sibling, prev_sibling
print("Next sibling of <p>:", scrape.next_sibling(p_elem).tag)
h1_elem = scrape.select_one(doc,"h1")
print("Previous sibling of <p>:", scrape.next_sibling(h1_elem))
# Example 3: ancestors and descendants
ancs = scrape.ancestors(p_elem)
print("Ancestor tags of <p>:", [a.tag for a in ancs])
desc = descendants(scrape.select_one(doc,"table"))
print("Descendant tags of <table>:", [d.tag for d in desc])
# Example 4: class utilities
div_html = '<div class="card primary"></div>'
div_elem = scrape.parse_html(div_html)
print("Has class 'card'? ->", scrape.has_class(div_elem, "card"))
print("Classes:", scrape.get_classes(div_elem))
# -------------------------------
# Resolve relative URLs
# -------------------------------
html = """
<html>
<body>
<a href="/about">About</a>
<img src="/images/logo.png">
</body>
</html>
"""
doc = scrape.parse_html(html)
base = "https://example.com"
# Get all <a> links
print(scrape.get_absolute_url(doc, "a", base_url=base))
# → 'https://example.com/about'
# Get all <img> sources
print(scrape.get_absolute_url(doc, "img", base_url=base, attr="src"))
# → 'https://example.com/images/logo.png'
# Extract tables
tables = scrape.get_table_content(doc, as_dicts=True)
print("Tables:", tables)
# -------------------------------
# XML Example
# -------------------------------
xml_content = """
<users>
<user id="1"><name>John</name></user>
<user id="2"><name>Jane</name></user>
</users>
"""
xml_doc = scrape.parse_xml(xml_content)
users = scrape.find_xml_all(xml_doc, "//user")
for u in users:
print(u.attrib, u.xpath("./name/text()")[0])
# Convert XML to dict
xml_dict = scrape.xml_to_dict(xml_doc)
print(xml_dict)
# -------------------------------
# JSON Example
# -------------------------------
json_content = '{"users":[{"name":"John","age":30},{"name":"Jane","age":25}]}'
data = scrape.parse_json(json_content)
# Access using path
john_age = scrape.json_get_value(data, "users.0.age")
print("John's age:", john_age)
# Extract all names
names = scrape.json_extract_values(data, "name")
print("Names:", names)
# Flatten JSON
flat = scrape.json_flatten(data)
print("Flattened JSON:", flat)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapery-0.0.2.tar.gz.
File metadata
- Download URL: scrapery-0.0.2.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc1dd4f188ff0a12f80c7f0cb2baa378ad1a7dbf259f061060739c9f76a12fb2
|
|
| MD5 |
e46430d49b24ea5c440a4d84957722f0
|
|
| BLAKE2b-256 |
5bbcc0e2192ac1b90ff6b6b13cee75afb0fc055ef0d34b1c50e3a95890d01632
|
File details
Details for the file scrapery-0.0.2-py3-none-any.whl.
File metadata
- Download URL: scrapery-0.0.2-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06df029000e9c52a50c45cc48597050e2fa918222eb108d8610cb73e5cc04f0f
|
|
| MD5 |
a5f018603fe557c67880fd1af26fddaf
|
|
| BLAKE2b-256 |
185419c8dc870199e6296888ed9f52903c0badf428b19e95ea40e9cf7da45f7a
|