Skip to main content

Scrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.

Project description

🕷️ scrapery

Free PyPI Version Python Versions Downloads

A blazing fast, lightweight, and modern parsing library for HTML, XML, and JSON, designed for web scraping and data extraction.
`It supports both XPath and CSS selectors, along with seamless DOM navigation, making parsing and extracting data straightforward and intuitive.


✨ Features

  • Blazing Fast Performance – Optimized for high-speed HTML, XML, and JSON parsing
  • 🎯 Dual Selector Support – Use XPath or CSS selectors for flexible extraction
  • 🛡 Comprehensive Error Handling – Detailed exceptions for different error scenarios
  • 🔄 Async Support – Built-in async utilities for high-concurrency scraping
  • 🧩 Robust Parsing – Encoding detection and content normalization for reliable results
  • 🧑‍💻 Function-Based API – Clean and intuitive interface for ease of use
  • 📦 Multi-Format Support – Parse HTML, XML, and JSON in a single library

⚡ Performance Comparison

The following benchmarks were run on sample HTML and JSON data to compare scrapery with other popular Python libraries.

Library HTML Parse Time JSON Parse Time
scrapery 12 ms 8 ms
Other library 120 ms N/A

⚠️ Actual performance may vary depending on your environment. These results are meant for illustrative purposes only. No library is endorsed or affiliated with scrapery.


📦 Installation

pip install scrapery

# -------------------------------
# HTML Example
# -------------------------------

import scrapery as scrape

html_content = """
<html>
    <body>
        <h1>Welcome</h1>
        <p>Hello<br>World</p>
        <a href="/about">About Us</a>
        <table>
            <tr><th>Name</th><th>Age</th></tr>
            <tr><td>John</td><td>30</td></tr>
            <tr><td>Jane</td><td>25</td></tr>
        </table>
    </body>
</html>
"""

# Parse HTML content
doc = scrape.parse_html(html_content)

# Extract text
# CSS selector: First <h1>
print(scrape.get_selector_content(doc, selector="h1"))  
# ➜ Welcome

# XPath: First <h1>
print(scrape.get_selector_content(doc, selector="//h1"))  
# ➜ Welcome

# CSS selector: <a href> attribute
print(scrape.get_selector_content(doc, selector="a", attr="href"))  
# ➜ /about

# XPath: <a> element href
print(scrape.get_selector_content(doc, selector="//a", attr="href"))  
# ➜ /about

# CSS: First <td> in table (John)
print(scrape.get_selector_content(doc, selector="td"))  
# ➜ John

# XPath: Second <td> (//td[2] = 30)
print(scrape.get_selector_content(doc, selector="//td[2]"))  
# ➜ 30

# XPath: Jane's age (//tr[3]/td[2])
print(scrape.get_selector_content(doc, selector="//tr[3]/td[2]"))  
# ➜ 25

# No css selector or XPath: full text
print(scrape.get_selector_content(doc))  
# ➜ Welcome HelloWorld About Us Name Age John 30 Jane 25

# Root attribute (lang, if it existed)
print(scrape.get_selector_content(doc, attr="lang"))  
# ➜ None

#-------------------------
# DOM navigation
#-------------------------
# Example 1: parent, children, siblings
p_elem = select_one(doc,"p")
print("Parent tag of <p>:", scrape.parent(p_elem).tag)
print("Children of <p>:", [c.tag for c in scrape.children(p_elem)])
print("Siblings of <p>:", [s.tag for s in scrape.siblings(p_elem)])

# Example 2: next_sibling, prev_sibling
print("Next sibling of <p>:", scrape.next_sibling(p_elem).tag)
h1_elem = scrape.select_one(doc,"h1")
print("Previous sibling of <p>:", scrape.next_sibling(h1_elem))

# Example 3: ancestors and descendants
ancs = scrape.ancestors(p_elem)
print("Ancestor tags of <p>:", [a.tag for a in ancs])
desc = descendants(scrape.select_one(doc,"table"))
print("Descendant tags of <table>:", [d.tag for d in desc])

# Example 4: class utilities
div_html = '<div class="card primary"></div>'
div_elem = scrape.parse_html(div_html)
print("Has class 'card'? ->", scrape.has_class(div_elem, "card"))
print("Classes:", scrape.get_classes(div_elem))

# -------------------------------
# Resolve relative URLs
# -------------------------------

html = """
<html>
  <body>
    <a href="/about">About</a>
    <img src="/images/logo.png">
  </body>
</html>
"""

doc = scrape.parse_html(html)
base = "https://example.com"

# Get all <a> links
print(scrape.get_absolute_url(doc, "a", base_url=base))
# → 'https://example.com/about'

# Get all <img> sources
print(scrape.get_absolute_url(doc, "img", base_url=base, attr="src"))
# → 'https://example.com/images/logo.png'

# Extract tables
tables = scrape.get_table_content(doc, as_dicts=True)
print("Tables:", tables)

# -------------------------------
# XML Example
# -------------------------------

xml_content = """
<users>
    <user id="1"><name>John</name></user>
    <user id="2"><name>Jane</name></user>
</users>
"""

xml_doc = scrape.parse_xml(xml_content)
users = scrape.find_xml_all(xml_doc, "//user")
for u in users:
    print(u.attrib, u.xpath("./name/text()")[0])

# Convert XML to dict
xml_dict = scrape.xml_to_dict(xml_doc)
print(xml_dict)

# -------------------------------
# JSON Example
# -------------------------------

json_content = '{"users":[{"name":"John","age":30},{"name":"Jane","age":25}]}'
data = scrape.parse_json(json_content)

# Access using path
john_age = scrape.json_get_value(data, "users.0.age")
print("John's age:", john_age)

# Extract all names
names = scrape.json_extract_values(data, "name")
print("Names:", names)

# Flatten JSON
flat = scrape.json_flatten(data)
print("Flattened JSON:", flat)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapery-0.0.2.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapery-0.0.2-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapery-0.0.2.tar.gz.

File metadata

  • Download URL: scrapery-0.0.2.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for scrapery-0.0.2.tar.gz
Algorithm Hash digest
SHA256 dc1dd4f188ff0a12f80c7f0cb2baa378ad1a7dbf259f061060739c9f76a12fb2
MD5 e46430d49b24ea5c440a4d84957722f0
BLAKE2b-256 5bbcc0e2192ac1b90ff6b6b13cee75afb0fc055ef0d34b1c50e3a95890d01632

See more details on using hashes here.

File details

Details for the file scrapery-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: scrapery-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for scrapery-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 06df029000e9c52a50c45cc48597050e2fa918222eb108d8610cb73e5cc04f0f
MD5 a5f018603fe557c67880fd1af26fddaf
BLAKE2b-256 185419c8dc870199e6296888ed9f52903c0badf428b19e95ea40e9cf7da45f7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page