Simple Python library for HTML parsing
Project description
Leaf
What is this?
This is a simple wrapper around lxml which adds some nice features to make working with lxml better. This library covers all my needs in HTML parsing.
Dependencies
lxml obviously :3
Features
- Nice jquery-like CSS selectors
- Simple access to element attributes
- Easy way to convert HTML to other formats (bbcode, markdown, etc.)
- A few nice functions for working with text
- And, of course, all original features of lxml
Description
The main function of the module (for my purposes) is leaf.parse
. This
function takes an HTML string as argument, and returns a leaf.Parser
object, which wraps an lxml object.
With this object you can do anything you want, for example:
document = leaf.parse(sample)
# get the links from the DIV with id 'menu' using CSS selectors
links = document('div#menu a')
Or you can do this:
# get first link or return None
link = document.get('div#menu a')
And you can get attributes from these results like this:
print link.onclick
You can also use standard lxml methods like object.xpath
, and they
return results as leaf.Parser
objects.
My favorite feature is parsing HTML into bbcode (markdown, etc.):
# Let's define simple formatter, which passes text
# and wraps links into [url][/url] (like bbcode)
def code_formatter(element, children):
# Replace <br> tag with line break
if element.tag == 'br':
return '\n'
# Wrap links into [url][/url]
if element.tag == 'a':
return u"[url=link}]{text}[/url]".format(link=element.href, text=children)
# Return children only for other elements.
if children:
return children
This function will be recursively called with element and children (this is string with children parsing result).
So, let's call this parser on some leaf.Parser
object:
document.parse(code_formatter)
More detailed examples available in the tests.
Finally, this library has some nice functions for working with text:
Name | Description |
---|---|
to_unicode | Convert string to unicode string |
strip_accents | Strip accents from a string |
strip_symbols | Strip ugly unicode symbols from a string |
strip_spaces | Strip excess spaces from a string |
strip_linebreaks | Strip excess line breaks from a string |
Change log
1.0.7
- Fix badges in README.md
- cleanup CHANGES.md
1.0.6
- Fix installation script on LICENSE file
1.0.4
- Convert documentation to Markdown
- Add support for universal wheel
1.0.1
- 100% test coverage
- fixed bug in result wrapping (etree._Element has __iter__ too!)
1.0
- add python3 support
- first production release
0.4.4
- fix inner_html method
- added **kwargs to the parse function, added inner_html method to the Parser class
- cssselect in deps
0.4.2
- Node attribute modification via node.href = '/blah'
- Custom default value for get: document.get(selector, default=None)
- Get element by index: document.get(selector, index)
0.4.1
- bool(node) returns True if element exists and False if element is None
0.4
- First public version
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file leaf-1.0.7.tar.gz
.
File metadata
- Download URL: leaf-1.0.7.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38c7fdef9de1a67961794d981260cd2dc5c16bb705aa11c746565f9b52856aa9 |
|
MD5 | 58df91645a06b97eda494758de834fa5 |
|
BLAKE2b-256 | 18a45c8c5caac9e03ea33b2384d16f5167c474cd7194cb2d7718de1d4d6156c4 |
File details
Details for the file leaf-1.0.7-py2.py3-none-any.whl
.
File metadata
- Download URL: leaf-1.0.7-py2.py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3ea38bf05e1cb4caee373192fc30c53a09c7890f2a000baf7b473df0a989910 |
|
MD5 | 77b50f83d8d0b5dbbe59423c26c1e712 |
|
BLAKE2b-256 | 0105dc58afe5bd51f3016a1329f7e891f77daf5b63abe518643be1b8cd9c4623 |