Skip to main content

A small library to smartly extract text from html and eventually rebuild html

Project description

pypi travis coveralls

Boned html is a small python library.

It helps you extract text from an html (in the form of a lxml tree), process this text, to classify it, reinject text in the html with specific css classes.

The typical use is for anotating an html with classes. For example you are categorizing text, and you want the user to visualize those categories on the original html.

The text will be extracted in a smart way: it won’t stop at semantic tags (<i>, <em>, etc.) but at other tags (<h1>, <p>, etc.).

As you reinject the text, semantic tags will be added back to text, and general html layout will be respected.


pip install boned-html


The fonctionalities are provided by the class boned_html.Chunker with methods:

  • chunk_tree to get text chunks from an lxml tree.
  • unchunk to put back chunks together providing css classes for pieces of text.

A quick example: imagine we have a function to detect a tel number value in a sentence:

>>> import re
>>> from itertools import cycle
>>> def get_tel(text):
...    splits = re.split(r"(\+?(?:\d\s*){8,13})", text)
...    return list(zip(splits, cycle([None, "tel"])))
>>> get_tel("call +33 00 00 00 00")
[('call ', None), ('+33 00 00 00 00', 'tel'), ('', None)]

And an html:

>>> html = '''
... <html>
...   <head><title>call +33 00 00 00 00</title></head>
...   <body>
...     <p>To get an operator <em>call</em></p>
...     <p><b>call</b> <em>(country) +33</em> 00 00 00 00</p>
...   </body>
... </html>
... '''

We chunk:

>>> import lxml.html
>>> from boned_html import HtmlBoner
>>> tree = lxml.html.fromstring(html)
>>> boned = HtmlBoner(tree)

We evaluate each text and assign “tel” class to it if there is a telephone:

>>> for i, text in enumerate(boned):
...     if text is not None:
...         boned.set_classes(i, get_tel(text))

We now rebuild the tree:

>>> boned.tree
<Element html ...>
>>> print(boned)
  <head><title>call +33 00 00 00 00</title></head>
    <p>To get an operator <em>call</em></p>
    <p><b>call</b> <em>(country) </em><span class="tel" id="chunk-6-1"><em>+33</em> 00 00 00 00</span></p>

We have a specific span around our number, also opening and closure of em tag was handled, and phone number in head/title remains the same.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for boned-html, version 0.2
Filename, size File type Python version Upload date Hashes
Filename, size boned_html-0.2-py3-none-any.whl (13.0 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size boned-html-0.2.tar.gz (10.9 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page