Skip to main content

A small library to smartly extract text from html and eventually rebuild html

Project description

pypi travis coveralls

Boned html is a small python library.

It helps you extract text from an html (in the form of a lxml tree), process this text, to classify it, reinject text in the html with specific css classes.

The typical use is for anotating an html with classes. For example you are categorizing text, and you want the user to visualize those categories on the original html.

The text will be extracted in a smart way: it won’t stop at semantic tags (<i>, <em>, etc.) but at other tags (<h1>, <p>, etc.).

As you reinject the text, semantic tags will be added back to text, and general html layout will be respected.

Installation

pip install boned-html

Usage

The fonctionalities are provided by the class boned_html.Chunker with methods:

  • chunk_tree to get text chunks from an lxml tree.
  • unchunk to put back chunks together providing css classes for pieces of text.

A quick example: imagine we have a function to detect a tel number value in a sentence:

>>> import re
>>> from itertools import cycle
>>> def get_tel(text):
...    splits = re.split(r"(\+?(?:\d\s*){8,13})", text)
...    return list(zip(splits, cycle([None, "tel"])))
>>> get_tel("call +33 00 00 00 00")
[('call ', None), ('+33 00 00 00 00', 'tel'), ('', None)]

And an html:

>>> html = '''
... <html>
...   <head><title>call +33 00 00 00 00</title></head>
...   <body>
...     <p>To get an operator <em>call</em></p>
...     <p><b>call</b> <em>(country) +33</em> 00 00 00 00</p>
...   </body>
... </html>
... '''

We chunk:

>>> import lxml.html
>>> from boned_html import HtmlBoner
>>> tree = lxml.html.fromstring(html)
>>> boned = HtmlBoner(tree)

We evaluate each text and assign “tel” class to it if there is a telephone:

>>> for i, text in enumerate(boned):
...     if text is not None:
...         boned.set_classes(i, get_tel(text))

We now rebuild the tree:

>>> boned.tree
<Element html ...>
>>> print(boned)
<html>
  <head><title>call +33 00 00 00 00</title></head>
  <body>
    <p>To get an operator <em>call</em></p>
    <p><b>call</b> <em>(country) </em><span class="tel" id="chunk-6-1"><em>+33</em> 00 00 00 00</span></p>
  </body>
</html>

We have a specific span around our number, also opening and closure of em tag was handled, and phone number in head/title remains the same.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
boned_html-0.2-py3-none-any.whl (13.0 kB) Copy SHA256 hash SHA256 Wheel py3
boned-html-0.2.tar.gz (10.9 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page