Skip to main content

A collection of utility functions for working with lxml and XPath.

Project description

A collection of utility functions for working with lxml and XPath.

Supports Python 2+.

Functions

  • inner_text(element: ElementBase) -> str: Extracts the combined text content of an element and its descendants, like accessing JavaScript's innerText attribute.
  • find_deepest_elements_containing_target_text(element: ElementBase, target_text: str) -> Iterator[ElementBase]: A generator that yields the deepest elements containing target text, starting from leaf nodes.
  • get_xpath(element: ElementBase, relative_to: Optional[ElementBase] = None) -> str: Generates the absolute or relative XPath expression for a given lxml element. If relative_to is specified, the XPath will be relative to that element.

Usage

Here's a brief example of how to use the functions:

from __future__ import print_function

from lxml import etree

from lxml_xpath_utils import inner_text, find_deepest_elements_containing_target_text, get_xpath

html = """<html>
    <body>
        <div id="main">
            <p>This is a paragraph with target text.</p>
            <div class="nested">
                <span>Some text here</span>
                <p>Another paragraph with target text in it.</p>
            </div>
            <p>No match here</p>
        </div>
        <div id="other">
            <p>target text appears again here</p>
        </div>
    </body>
</html>"""

target_text = "target text"

parser = etree.HTMLParser()
root = etree.fromstring(html, parser)

matching_elements = find_deepest_elements_containing_target_text(root, target_text)

print("Elements containing '%s':" % target_text)
for elem in matching_elements:
    absolute_xpath = get_xpath(elem)
    print(absolute_xpath)
    assert elem == root.xpath(absolute_xpath)[0]
    print("  Text: %s\n" % inner_text(elem))

print("Elements containing '%s' in /html/body/div[@id='main']:" % target_text)
new_root = root.xpath('/html/body/div[@id="main"]')[0]
new_matching_elements = find_deepest_elements_containing_target_text(new_root, target_text)

for new_elem in new_matching_elements:
    relative_xpath = get_xpath(new_elem, new_root)
    print(relative_xpath)
    assert new_elem == new_root.xpath(relative_xpath)[0]
    print("  Text: %s\n" % inner_text(new_elem))

Output:

Elements containing 'target text':
/html/body/div[1]/p[1]
  Text: This is a paragraph with target text.

/html/body/div[1]/div/p
  Text: Another paragraph with target text in it.

/html/body/div[2]/p
  Text: target text appears again here

Elements containing 'target text' in /html/body/div[@id='main']:
./p[1]
  Text: This is a paragraph with target text.

./div/p
  Text: Another paragraph with target text in it.

Contributing

Feel free to contribute to this project by submitting pull requests or opening issues.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lxml_xpath_utils-0.1.1.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lxml_xpath_utils-0.1.1-py2.py3-none-any.whl (4.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file lxml_xpath_utils-0.1.1.tar.gz.

File metadata

  • Download URL: lxml_xpath_utils-0.1.1.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for lxml_xpath_utils-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3ddc77c5b56c593040c197dc5027a6e553ef1bba1e9498ba43970493d0582f0d
MD5 9b8f4bb605d8b8f6576e879356bfde56
BLAKE2b-256 008b4882602dfba5edee901c6bfd802bc727d8a263bf034db1bbbce7fec33747

See more details on using hashes here.

File details

Details for the file lxml_xpath_utils-0.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for lxml_xpath_utils-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 042e22565bc65ac1a4c610ce453721f77e6c865c79922b50d0549e9d4013a450
MD5 093cddc57e0d12ac041439536858c552
BLAKE2b-256 c57cc98b0ef7d7a0977bc931c4b1642364e3a90bc0ab842fdb4b7134600e8057

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page