Skip to main content

An HTML-friendly spaCy tokenizer

Project description

HTML-friendly spaCy Tokenizer

It's not an HTML tokenizer, but a tokenizer that works with text that happens to be embedded in HTML.

How it works

Under the hood we use selectolax to parse HTML. From there, common elements used for styling within traditional text elements elements (e.g. <b> or <span> inside of a <p>) are unwrapped, meaning the text contained within those elements becomes nested inside their parent elements. You can change this with the unwrapped_tags argument to the constructor. Tags used for non-text content, such as <script> and <style> are removed. Then the text is extracted from each remaining terminal node that contains text. These texts are then tokenized with the standard tokenizer defaults and then combined into a single Doc. The end result is a Doc, but each element's text from the original document is also a sentence, so you can iterate through each element's text with doc.sents.

Example

import spacy
from spacy_html_tokenizer import create_html_tokenizer

nlp = spacy.blank("en")
nlp.tokenizer = create_html_tokenizer()(nlp)

html = """<h2>An Ordered HTML List</h2>
<ol>
    <li><b>Good</b> coffee. There's another sentence here</li>
    <li>Tea and honey</li>
    <li>Milk</li>
</ol>"""

doc = nlp(html)
for sent in doc.sents:
    print(sent.text, "-- N Tokens:", len(sent))

# An Ordered HTML List -- N Tokens: 4
# Good coffee. There's another sentence here -- N Tokens: 8
# Tea and honey -- N Tokens: 3
# Milk -- N Tokens: 1

In the prior example, we didn't have any other sentence boundary detection components. However, this will also work with downstream sentence boundary detection components -- e.g.

nlp = spacy.load("en_core_web_sm")  # has parser for sentence boundary detection
nlp.tokenizer = create_html_tokenizer()(nlp)

doc = nlp(html)
for sent in doc.sents:
    print(sent.text, "-- N Tokens:", len(sent))

# An Ordered HTML List -- N Tokens: 4
# Good coffee. -- N Tokens: 3
# There's another sentence here -- N Tokens: 5
# Tea and honey -- N Tokens: 3
# Milk -- N Tokens: 1

Comparison

We'll compare parsing Explosion's About page with and without the HTML tokenizer.

import requests
import spacy
from spacy_html_tokenizer import create_html_tokenizer
from selectolax.parser import HTMLParser

about_page_html = requests.get("https://explosion.ai/about").text

nlp_default = spacy.load("en_core_web_lg")
nlp_html = spacy.load("en_core_web_lg")
nlp_html.tokenizer = create_html_tokenizer()(nlp_html)

# text from HTML - used for non-HTML default tokenizer
about_page_text = HTMLParser(about_page_html).text()

doc_default = nlp_default(about_page_text)
doc_html = nlp_html(about_page_html)

View first sentences of each

With standard tokenizer on text extracted from HTML

list(sent.text for sent in doc_default.sents)[:5]
['AboutSoftware & DemosCustom SolutionsBlog & NewsAbout usExplosion is a software company specializing in developer tools for Artificial\nIntelligence and Natural Language Processing.',
'We’re the makers of\nspaCy, one of the leading open-source libraries for advanced\nNLP and Prodigy, an annotation tool for radically efficient\nmachine teaching.',
'\n\n',
'Ines Montani CEO, FounderInes is a co-founder of Explosion and a core developer of the spaCy NLP library and the Prodigy annotation tool.',
'She has helped set a new standard for user experience in developer tools for AI engineers and researchers.']

With HTML Tokenizer on HTML

list(sent.text for sent in doc_html.sents)[:10]
['About us · Explosion',
 'About',
 'Software',
 '&',
 'Demos',
 'Custom Solutions',
 'Blog & News',
 'About us',
 'Explosion is a software company specializing in developer tools for Artificial Intelligence and Natural Language Processing.',
 'We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP and Prodigy, an annotation tool for radically efficient machine teaching.']

What about the last sentence?

list(sent.text for sent in doc_default.sents)[-1]

# We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP.NavigationHomeAbout usSoftware & DemosCustom SolutionsBlog & NewsOur SoftwarespaCy · Industrial-strength NLPProdigy · Radically efficient annotationThinc · Functional deep learning© 2016-2022 Explosion · Legal & Imprint/*<![CDATA[*/window.pagePath="/about";/*]]>*//*<![CDATA[*/window.___chunkMapping={"app":["/app-ac229f07fa81f29e0f2d.js"],"component---node-modules-gatsby-plugin-offline-app-shell-js":["/component---node-modules-gatsby-plugin-offline-app-shell-js-461e7bc49c6ae8260783.js"],"component---src-components-post-js":["/component---src-components-post-js-cf4a6bf898db64083052.js"],"component---src-pages-404-js":["/component---src-pages-404-js-b7a6fa1d9d8ca6c40071.js"],"component---src-pages-blog-js":["/component---src-pages-blog-js-1e313ce0b28a893d3966.js"],"component---src-pages-index-js":["/component---src-pages-index-js-175434c68a53f68a253a.js"],"component---src-pages-spacy-tailored-pipelines-js":["/component---src-pages-spacy-tailored-pipelines-js-028d0c6c19584ef0935f.js"]};/*]]>*/

Yikes. How about HTML Tokenizer?

list(sent.text for sent in doc_html.sents)[-1]

# '© 2016-2022 Explosion · Legal & Imprint'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy-html-tokenizer-0.1.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

spacy_html_tokenizer-0.1.0-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file spacy-html-tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: spacy-html-tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0

File hashes

Hashes for spacy-html-tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3cb63df12b36880bd8f9a28dcfc4f73dcd87a8b8e27a3687cc189af4b9439abf
MD5 3996c9dd60960c726412b07487a997d8
BLAKE2b-256 e561012445e4f82d185542a936e306e7dcfea0a227789813ca722c25f4332d4e

See more details on using hashes here.

File details

Details for the file spacy_html_tokenizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_html_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 82602a371e1b30161da8395beaa65b90e7e3c1ef2bbec1c2a68fbb92a79c6ca5
MD5 bdd521480b12970407f5872714995aa2
BLAKE2b-256 6619c25f5c8ed70c2005704b7cd16fdf3059bb3e6d2a5219290aa16e6336f314

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page