An HTML-friendly spaCy tokenizer
Project description
HTML-friendly spaCy Tokenizer
It's not an HTML tokenizer, but a tokenizer that works with text that happens to be embedded in HTML.
How it works
Under the hood we use selectolax
to parse HTML. From there, common elements used for styling within traditional text elements elements (e.g. <b>
or <span>
inside of a <p>
) are unwrapped, meaning the text contained within those elements becomes nested inside their parent elements. You can change this with the unwrapped_tags
argument to the constructor. Tags used for non-text content, such as <script>
and <style>
are removed. Then the text is extracted from each remaining terminal node that contains text. These texts are then tokenized with the standard tokenizer defaults and then combined into a single Doc
. The end result is a Doc
, but each element's text from the original document is also a sentence, so you can iterate through each element's text with doc.sents
.
Example
import spacy
from spacy_html_tokenizer import create_html_tokenizer
nlp = spacy.blank("en")
nlp.tokenizer = create_html_tokenizer()(nlp)
html = """<h2>An Ordered HTML List</h2>
<ol>
<li><b>Good</b> coffee. There's another sentence here</li>
<li>Tea and honey</li>
<li>Milk</li>
</ol>"""
doc = nlp(html)
for sent in doc.sents:
print(sent.text, "-- N Tokens:", len(sent))
# An Ordered HTML List -- N Tokens: 4
# Good coffee. There's another sentence here -- N Tokens: 8
# Tea and honey -- N Tokens: 3
# Milk -- N Tokens: 1
In the prior example, we didn't have any other sentence boundary detection components. However, this will also work with downstream sentence boundary detection components -- e.g.
nlp = spacy.load("en_core_web_sm") # has parser for sentence boundary detection
nlp.tokenizer = create_html_tokenizer()(nlp)
doc = nlp(html)
for sent in doc.sents:
print(sent.text, "-- N Tokens:", len(sent))
# An Ordered HTML List -- N Tokens: 4
# Good coffee. -- N Tokens: 3
# There's another sentence here -- N Tokens: 5
# Tea and honey -- N Tokens: 3
# Milk -- N Tokens: 1
Comparison
We'll compare parsing Explosion's About page with and without the HTML tokenizer.
import requests
import spacy
from spacy_html_tokenizer import create_html_tokenizer
from selectolax.parser import HTMLParser
about_page_html = requests.get("https://explosion.ai/about").text
nlp_default = spacy.load("en_core_web_lg")
nlp_html = spacy.load("en_core_web_lg")
nlp_html.tokenizer = create_html_tokenizer()(nlp_html)
# text from HTML - used for non-HTML default tokenizer
about_page_text = HTMLParser(about_page_html).text()
doc_default = nlp_default(about_page_text)
doc_html = nlp_html(about_page_html)
View first sentences of each
With standard tokenizer on text extracted from HTML
list(sent.text for sent in doc_default.sents)[:5]
['AboutSoftware & DemosCustom SolutionsBlog & NewsAbout usExplosion is a software company specializing in developer tools for Artificial\nIntelligence and Natural Language Processing.',
'We’re the makers of\nspaCy, one of the leading open-source libraries for advanced\nNLP and Prodigy, an annotation tool for radically efficient\nmachine teaching.',
'\n\n',
'Ines Montani CEO, FounderInes is a co-founder of Explosion and a core developer of the spaCy NLP library and the Prodigy annotation tool.',
'She has helped set a new standard for user experience in developer tools for AI engineers and researchers.']
With HTML Tokenizer on HTML
list(sent.text for sent in doc_html.sents)[:10]
['About us · Explosion',
'About',
'Software',
'&',
'Demos',
'Custom Solutions',
'Blog & News',
'About us',
'Explosion is a software company specializing in developer tools for Artificial Intelligence and Natural Language Processing.',
'We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP and Prodigy, an annotation tool for radically efficient machine teaching.']
What about the last sentence?
list(sent.text for sent in doc_default.sents)[-1]
# We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP.NavigationHomeAbout usSoftware & DemosCustom SolutionsBlog & NewsOur SoftwarespaCy · Industrial-strength NLPProdigy · Radically efficient annotationThinc · Functional deep learning© 2016-2022 Explosion · Legal & Imprint/*<![CDATA[*/window.pagePath="/about";/*]]>*//*<![CDATA[*/window.___chunkMapping={"app":["/app-ac229f07fa81f29e0f2d.js"],"component---node-modules-gatsby-plugin-offline-app-shell-js":["/component---node-modules-gatsby-plugin-offline-app-shell-js-461e7bc49c6ae8260783.js"],"component---src-components-post-js":["/component---src-components-post-js-cf4a6bf898db64083052.js"],"component---src-pages-404-js":["/component---src-pages-404-js-b7a6fa1d9d8ca6c40071.js"],"component---src-pages-blog-js":["/component---src-pages-blog-js-1e313ce0b28a893d3966.js"],"component---src-pages-index-js":["/component---src-pages-index-js-175434c68a53f68a253a.js"],"component---src-pages-spacy-tailored-pipelines-js":["/component---src-pages-spacy-tailored-pipelines-js-028d0c6c19584ef0935f.js"]};/*]]>*/
Yikes. How about HTML Tokenizer?
list(sent.text for sent in doc_html.sents)[-1]
# '© 2016-2022 Explosion · Legal & Imprint'
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spacy-html-tokenizer-0.1.0.tar.gz
.
File metadata
- Download URL: spacy-html-tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cb63df12b36880bd8f9a28dcfc4f73dcd87a8b8e27a3687cc189af4b9439abf |
|
MD5 | 3996c9dd60960c726412b07487a997d8 |
|
BLAKE2b-256 | e561012445e4f82d185542a936e306e7dcfea0a227789813ca722c25f4332d4e |
File details
Details for the file spacy_html_tokenizer-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: spacy_html_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82602a371e1b30161da8395beaa65b90e7e3c1ef2bbec1c2a68fbb92a79c6ca5 |
|
MD5 | bdd521480b12970407f5872714995aa2 |
|
BLAKE2b-256 | 6619c25f5c8ed70c2005704b7cd16fdf3059bb3e6d2a5219290aa16e6336f314 |