Skip to main content

Unified data extraction — CSS, XPath, Regex, and JMESPath behind one query interface.

Project description

ChadSelect

One query. Any format. Every selector.

Unified data extraction — CSS Selectors, XPath 1.0, Regex, and JMESPath behind one query interface. Load your content, prefix your query, get results. Never raises.

PyPI License: MIT

from chadselect import ChadSelect

cs = ChadSelect()
cs.add_html('<span class="price">$49.99</span>')

price = cs.select(0, "css:.price")
assert price == "$49.99"

Install

pip install chadselect

Query Syntax

Every query uses an engine:expression prefix. No prefix defaults to regex.

Prefix Engine Content Types Backed By
css: CSS Selectors HTML selectolax (lexbor)
xpath: XPath 1.0 HTML, Text lxml (libxml2)
regex: Regular Expressions All re (stdlib)
json: JMESPath JSON jmespath

The index Parameter

Every query method takes an index argument that controls which match to return:

Value Behavior
-1 Return all matches across every loaded document
0 Return only the first match
N Return only the Nth match (0-based)
cs = ChadSelect()
cs.add_html("<ul><li>A</li><li>B</li><li>C</li></ul>")

all_items = cs.query(-1, "css:li")  # ["A", "B", "C"]
first     = cs.query(0,  "css:li")  # ["A"]
third     = cs.query(2,  "css:li")  # ["C"]
oob       = cs.query(99, "css:li")  # []  (out of bounds — never raises)

# select() wraps query() — returns a single str
s = cs.select(0, "css:li")           # "A"
s = cs.select(-1, "css:li")          # "A" (first of all matches)

When multiple documents are loaded, -1 aggregates results from all compatible documents before indexing.


Content Management

Load one or more documents. Each document is tagged by type and only queried by compatible engines.

from chadselect import ChadSelect

cs = ChadSelect()

# HTML — compatible with css:, xpath:, regex:
cs.add_html("""
<html>
  <body>
    <h1 class="title">2024 Honda Civic</h1>
    <span class="price">$28,500</span>
    <div class="details">
      <div class="item"><span class="label">VIN:</span> 1HGFE2F59PA000001</div>
      <div class="item"><span class="label">Exterior:</span> Blue Metallic</div>
      <div class="item"><span class="label">Interior:</span> Black Leather</div>
      <div class="item"><span class="label">Mileage:</span> 12,345 mi</div>
    </div>
    <a class="dealer-link" href="https://example.com/dealer/42">View Dealer</a>
  </body>
</html>
""")

# JSON — compatible with json:, regex:
cs.add_json("""{
  "inventory": [
    {"id": 1, "name": "Civic",   "price": 28500, "tags": ["sedan", "honda"]},
    {"id": 2, "name": "Accord",  "price": 34000, "tags": ["sedan", "honda"]},
    {"id": 3, "name": "CR-V",    "price": 32500, "tags": ["suv",   "honda"]}
  ],
  "dealer": {"name": "Metro Honda", "rating": 4.8}
}""")

# Plain text — compatible with regex:, xpath:
cs.add_text("Order #12345 confirmed. Total: $99.50")

assert cs.content_count() == 3

cs.clear()  # remove all content

CSS Selectors

Standard CSS selectors, plus custom text pseudo-selectors for scraping.

cs = ChadSelect()
cs.add_html("""
<ul class="products">
  <li class="product" data-id="1"><span class="name">Widget</span><span class="price">$19.99</span></li>
  <li class="product" data-id="2"><span class="name">Gadget</span><span class="price">$49.99</span></li>
  <li class="product" data-id="3"><span class="name">Doohickey</span><span class="price">$9.99</span></li>
</ul>
""")

# Basic selectors
first_name = cs.select(0, "css:.product .name")
assert first_name == "Widget"

# All matches — index -1
all_prices = cs.query(-1, "css:.product .price")
assert all_prices == ["$19.99", "$49.99", "$9.99"]

# Nth match — index 2 (0-based)
third = cs.query(2, "css:.product .name")
assert third == ["Doohickey"]

# Attribute extraction via get-attr()
id_val = cs.select(0, "css:.product >> get-attr('data-id')")
assert id_val == "1"

Text Pseudo-Selectors

These work like Playwright's pseudo-selectors — match elements by text content.

Pseudo-Selector Behavior
:has-text('x') Element or its descendants contain the text
:contains-text('x') Element's own text contains the text
:text-equals('x') Element's text exactly equals
:text-starts('x') Element's text starts with
:text-ends('x') Element's text ends with
cs = ChadSelect()
cs.add_html("""
<div class="specs">
  <div class="row"><span class="label">Exterior</span><span class="value">Blue Metallic</span></div>
  <div class="row"><span class="label">Interior</span><span class="value">Black Leather</span></div>
  <div class="row"><span class="label">Engine</span><span class="value">2.0L Turbo</span></div>
</div>
""")

# :has-text — matches the .row whose subtree contains "Exterior"
color = cs.select(0, "css:.row:has-text('Exterior') .value")
assert color == "Blue Metallic"

# :text-equals — exact match on element text
engine_label = cs.select(0, "css:.label:text-equals('Engine')")
assert engine_label == "Engine"

# :text-starts — prefix match
starts_e = cs.select(0, "css:.label:text-starts('Ext')")
assert starts_e == "Exterior"

# :text-ends — suffix match
ends_or = cs.select(0, "css:.label:text-ends('ior')")
assert ends_or == "Exterior"

# Combine with function piping
upper_interior = cs.select(0, "css:.row:has-text('Interior') .value >> uppercase()")
assert upper_interior == "BLACK LEATHER"

XPath 1.0

Full XPath 1.0 support including axes, predicates, and XPath functions.

cs = ChadSelect()
cs.add_html("""
<html>
  <body>
    <h1 id="title">  2024 Honda Civic  </h1>
    <table class="specs">
      <tr><td>VIN</td><td>1HGFE2F59PA000001</td></tr>
      <tr><td>Price</td><td>$28,500</td></tr>
      <tr><td>Mileage</td><td>12,345 mi</td></tr>
    </table>
  </body>
</html>
""")

# text() extraction
title = cs.select(0, "xpath://h1[@id='title']/text()")
assert title == "  2024 Honda Civic  "

# With normalize-space
clean_title = cs.select(0, "xpath:normalize-space(//h1[@id='title'])")
assert clean_title == "2024 Honda Civic"

# Predicate-based selection — find the <td> after "VIN"
vin = cs.select(0, "xpath://tr[td='VIN']/td[2]/text()")
assert vin == "1HGFE2F59PA000001"

# All values from the second column
all_values = cs.query(-1, "xpath://table[@class='specs']//tr/td[2]/text()")
assert all_values == ["1HGFE2F59PA000001", "$28,500", "12,345 mi"]

# XPath string() on attribute
title_id = cs.select(0, "xpath:string(//h1/@id)")
assert title_id == "title"

Regex

Capture groups or full matches. Works on HTML, JSON, and plain text content.

cs = ChadSelect()
cs.add_text("VIN: 1HGFE2F59PA000001 | Stock #: A12345 | Price: $28,500")

# Capture group — returns the group, not the full match
vin = cs.select(0, r"regex:VIN:\s*([A-HJ-NPR-Z0-9]{17})")
assert vin == "1HGFE2F59PA000001"

# Full match — no capture group
stock = cs.select(0, r"regex:Stock #:\s*\S+")
assert stock == "Stock #: A12345"

# Multiple capture groups — returns first group
price_digits = cs.select(0, r"regex:Price:\s*\$([0-9,]+)")
assert price_digits == "28,500"

# All matches
all_numbers = cs.query(-1, r"regex:\d+")
# Returns all digit sequences found in the text

# No prefix — defaults to regex
vin2 = cs.select(0, r"[A-HJ-NPR-Z0-9]{17}")
assert vin2 == "1HGFE2F59PA000001"

Regex on HTML

Regex runs on the raw HTML string, not parsed text — useful for extracting from attributes, comments, or script tags.

cs = ChadSelect()
cs.add_html("<script>var price = 28500;</script>")

price = cs.select(0, r"regex:var price\s*=\s*(\d+)")
assert price == "28500"

JMESPath (JSON)

Full JMESPath expression support for structured JSON extraction.

cs = ChadSelect()
cs.add_json("""{
  "inventory": [
    {"id": 1, "name": "Civic",   "price": 28500, "tags": ["sedan", "honda"]},
    {"id": 2, "name": "Accord",  "price": 34000, "tags": ["sedan", "honda"]},
    {"id": 3, "name": "CR-V",    "price": 32500, "tags": ["suv",   "honda"]}
  ],
  "dealer": {"name": "Metro Honda", "rating": 4.8}
}""")

# Simple field access
dealer = cs.select(0, "json:dealer.name")
assert dealer == "Metro Honda"

# Array indexing
first = cs.select(0, "json:inventory[0].name")
assert first == "Civic"

# Projection — all names
names = cs.query(-1, "json:inventory[*].name")
assert names == ["Civic", "Accord", "CR-V"]

# Filter expression
expensive = cs.query(-1, "json:inventory[?price > `30000`].name")
assert expensive == ["Accord", "CR-V"]

# Nested access
rating = cs.select(0, "json:dealer.rating")
assert rating == "4.8"

# Flatten nested arrays
all_tags = cs.query(-1, "json:inventory[*].tags[]")
assert all_tags == ["sedan", "honda", "sedan", "honda", "suv", "honda"]

Post-Processing Functions

Pipe results through text transformations using >>. This operator was chosen over | because | is reserved by XPath (union) and JMESPath (pipe).

css:.selector >> function1() >> function2()
xpath://path/text() >> trim() >> uppercase()
regex:pattern >> replace('$', 'USD ')
Function Description Example
normalize-space() Trim + collapse internal whitespace css:.desc >> normalize-space()
trim() Trim leading/trailing whitespace css:.title >> trim()
uppercase() Convert to UPPER CASE css:.vin >> uppercase()
lowercase() Convert to lower case css:.name >> lowercase()
substring(start, len) Extract substring (0-based) css:.code >> substring(0, 3)
substring-after('delim') Text after first delimiter css:.info >> substring-after('VIN: ')
substring-before('delim') Text before first delimiter css:.info >> substring-before(': ')
replace('find', 'repl') Replace all occurrences css:.price >> replace('$', 'USD ')
get-attr('name') Element attribute (CSS only) css:a.link >> get-attr('href')

Chaining Functions

Functions execute left-to-right. Empty results are filtered after each step.

cs = ChadSelect()
cs.add_html('<div class="info">  VIN: 1HGFE2F59PA000001  </div>')

# Chain: extract text → get everything after "VIN: " → first 3 chars → lowercase
result = cs.select(0, "css:.info >> substring-after('VIN: ') >> substring(0, 3) >> lowercase()")
assert result == "1hg"
cs = ChadSelect()
cs.add_html('<a class="link" href="/inventory/123">View Car</a>')

# Attribute extraction
href = cs.select(0, "css:a.link >> get-attr('href')")
assert href == "/inventory/123"
cs = ChadSelect()
cs.add_html('<span class="price">  $ 28,500  </span>')

# Clean + transform
clean_price = cs.select(0, "css:.price >> normalize-space() >> replace('$ ', '$')")
assert clean_price == "$28,500"

API Reference

Core Query Methods

from chadselect import ChadSelect

cs = ChadSelect()
cs.add_html(html)

# query() — returns list[str], never raises
all_matches = cs.query(-1, "css:.price")   # all results
first_only  = cs.query(0,  "css:.price")   # list with 1st result or []
third       = cs.query(2,  "css:.price")   # list with 3rd result or []

# select() — returns str, empty on no match
price = cs.select(0, "css:.price")          # first valid result or ""

Fallback Chains — select_first

Try queries in priority order. Returns the first result set where all values pass validation.

cs = ChadSelect()
cs.add_html('<span class="alt-price">$28,500</span>')

# #exact-id doesn't exist, falls through to .alt-price
result = cs.select_first([
    (0, "css:#exact-id"),
    (0, "css:.alt-price"),
    (0, r"regex:\$[\d,]+"),
])
assert result == ["$28,500"]

Multi-Source — select_many

Combine unique results from multiple queries.

cs = ChadSelect()
cs.add_html("""
<span class="msrp">$30,000</span>
<span class="sale">$28,500</span>
""")

prices = cs.select_many([
    (0, "css:.msrp"),
    (0, "css:.sale"),
])
# Contains both "$30,000" and "$28,500" (unique, order preserved)
assert "$30,000" in prices
assert "$28,500" in prices

Custom Validators — select_where

Filter results with a callback. The _where variants exist for select, select_first, and select_many.

cs = ChadSelect()
cs.add_html('<span class="price">0</span><span class="price">28500</span>')

# Reject "0" as a valid price
price = cs.select_where(0, "css:.price", lambda s: s != "0")
assert price == ""  # first match "0" rejected, no fallback within select_where

# With select_first_where — falls through to next query
cs2 = ChadSelect()
cs2.add_text("a: 0\nb: 42")

r = cs2.select_first_where(
    [(0, r"a: (\d+)"), (0, r"b: (\d+)")],
    lambda s: s != "0",
)
assert r == ["42"]

Batch Queries — query_batch

Execute many queries in one call. Returns list[list[str]] in input order.

cs = ChadSelect()
cs.add_html("<h1>Civic</h1><span class='price'>$28,500</span>")
cs.add_json('{"dealer": "Metro Honda"}')

results = cs.query_batch([
    (0, "css:h1"),
    (0, "css:.price"),
    (0, "json:dealer"),
])
assert results[0] == ["Civic"]
assert results[1] == ["$28,500"]
assert results[2] == ["Metro Honda"]

Multi-Content Queries

When multiple documents are loaded, queries search across all compatible content. Use query(-1, ...) to get results from every document.

cs = ChadSelect()

cs.add_html('<span class="title">Page 1</span>')
cs.add_html('<span class="title">Page 2</span>')

# Searches both HTML documents
titles = cs.query(-1, "css:.title")
assert titles == ["Page 1", "Page 2"]

# Mixing content types
cs.add_json('{"title": "JSON Title"}')

# css: only queries HTML content — JSON is skipped
html_titles = cs.query(-1, "css:.title")
assert html_titles == ["Page 1", "Page 2"]

# json: only queries JSON content
json_title = cs.select(0, "json:title")
assert json_title == "JSON Title"

# regex: searches everything
all_results = cs.query(-1, r"regex:(?:Page \d|JSON Title)")
assert len(all_results) == 3

Error Handling

ChadSelect never raises. Every invalid query, malformed content, or out-of-bounds index returns empty results.

cs = ChadSelect()
cs.add_html("<div>hello</div>")

# Invalid CSS selector — returns ""
r = cs.select(0, "css:][invalid")
assert r == ""

# Out of bounds index — returns []
r = cs.query(999, "css:div")
assert r == []

# Wrong engine for content type — returns ""
cs.add_json('{"a": 1}')
r = cs.select(0, "css:.something")  # css: doesn't apply to JSON
# Only the HTML is searched, no ".something" found → ""

Design Principles

  1. Never raise — invalid queries, malformed content, and out-of-bounds indices all return empty results
  2. Prefix routing — the query string declares the engine; no mode switching or builder patterns
  3. >> function pipe — unambiguous across all engines; XPath | and JMESPath | work natively
  4. Batteries included — post-processing, text pseudo-selectors, validators, and index selection are all built in

Also Available

ChadSelect is also available as a Rust crate with identical API and query syntax.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chadselect-0.2.1.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chadselect-0.2.1-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file chadselect-0.2.1.tar.gz.

File metadata

  • Download URL: chadselect-0.2.1.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chadselect-0.2.1.tar.gz
Algorithm Hash digest
SHA256 7e95b57c3dce93b7ac55f26fc4a7e10f39b50bcd450f131df4781dfa9c66d6e1
MD5 b654db7f477eb5dbe395ee06ef5e5df9
BLAKE2b-256 3c4cf0dec94f71f093d23bf426eb8ce9b0997951c4ccb4ecac11eda6a0597495

See more details on using hashes here.

File details

Details for the file chadselect-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: chadselect-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chadselect-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 63969b1444f9053a214f96abb8b7631b03d5d3b1bc83fbe1804f39d9424d9c3b
MD5 07c311ef401d206524c85e0d7cfc6e3b
BLAKE2b-256 164f8dc4f92e821aa04f977cf8c8b2b8cb8a653f81d5053554a2992040f43011

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page