Skip to main content

Simple Python library for HTML parsing

Project description

Leaf

image Coverage Status Downloads Latest Version License

What is this?

This is a simple wrapper around lxml which adds some nice features to make working with lxml better. This library covers all my needs in HTML parsing.

Dependencies

lxml obviously :3

Features

  • Nice jquery-like CSS selectors
  • Simple access to element attributes
  • Easy way to convert HTML to other formats (bbcode, markdown, etc.)
  • A few nice functions for working with text
  • And, of course, all original features of lxml

Description

The main function of the module (for my purposes) is leaf.parse. This function takes an HTML string as argument, and returns a leaf.Parser object, which wraps an lxml object.

With this object you can do anything you want, for example:

document = leaf.parse(sample)
# get the links from the DIV with id 'menu' using CSS selectors
links = document('div#menu a')

Or you can do this:

# get first link or return None
link = document.get('div#menu a')

And you can get attributes from these results like this:

print link.onclick

You can also use standard lxml methods like object.xpath, and they return results as leaf.Parser objects.

My favorite feature is parsing HTML into bbcode (markdown, etc.):

# Let's define simple formatter, which passes text
# and wraps links into [url][/url] (like bbcode)
def code_formatter(element, children):
    # Replace <br> tag with line break
    if element.tag == 'br':
        return '\n'
    # Wrap links into [url][/url]
    if element.tag == 'a':
        return u"[url=link}]{text}[/url]".format(link=element.href, text=children)
    # Return children only for other elements.
    if children:
        return children

This function will be recursively called with element and children (this is string with children parsing result).

So, let's call this parser on some leaf.Parser object:

document.parse(code_formatter)

More detailed examples available in the tests.

Finally, this library has some nice functions for working with text:

Name Description
to_unicode Convert string to unicode string
strip_accents Strip accents from a string
strip_symbols Strip ugly unicode symbols from a string
strip_spaces Strip excess spaces from a string
strip_linebreaks Strip excess line breaks from a string

Change log

1.0.7

  • Fix badges in README.md
  • cleanup CHANGES.md

1.0.6

  • Fix installation script on LICENSE file

1.0.4

  • Convert documentation to Markdown
  • Add support for universal wheel

1.0.1

  • 100% test coverage
  • fixed bug in result wrapping (etree._Element has __iter__ too!)

1.0

  • add python3 support
  • first production release

0.4.4

  • fix inner_html method
  • added **kwargs to the parse function, added inner_html method to the Parser class
  • cssselect in deps

0.4.2

  • Node attribute modification via node.href = '/blah'
  • Custom default value for get: document.get(selector, default=None)
  • Get element by index: document.get(selector, index)

0.4.1

  • bool(node) returns True if element exists and False if element is None

0.4

  • First public version

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaf-1.0.7.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

leaf-1.0.7-py2.py3-none-any.whl (5.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file leaf-1.0.7.tar.gz.

File metadata

  • Download URL: leaf-1.0.7.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.7.4

File hashes

Hashes for leaf-1.0.7.tar.gz
Algorithm Hash digest
SHA256 38c7fdef9de1a67961794d981260cd2dc5c16bb705aa11c746565f9b52856aa9
MD5 58df91645a06b97eda494758de834fa5
BLAKE2b-256 18a45c8c5caac9e03ea33b2384d16f5167c474cd7194cb2d7718de1d4d6156c4

See more details on using hashes here.

File details

Details for the file leaf-1.0.7-py2.py3-none-any.whl.

File metadata

  • Download URL: leaf-1.0.7-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.7.4

File hashes

Hashes for leaf-1.0.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d3ea38bf05e1cb4caee373192fc30c53a09c7890f2a000baf7b473df0a989910
MD5 77b50f83d8d0b5dbbe59423c26c1e712
BLAKE2b-256 0105dc58afe5bd51f3016a1329f7e891f77daf5b63abe518643be1b8cd9c4623

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page