Fast C based HTML 5 parsing for python
Project description
A fast implementation of the HTML 5 parsing spec. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x.
Installation
Unix
On a Unix-y system, with a working compiler, simply run:
pip install --no-binary lxml html5-parser
It is important that lxml is installed with the –no-binary flags. This is because without it, lxml uses a static copy of libxml2. For html5-parser to work it must use the same libxml2 implementation as lxml. This is only possible if libxml2 is loaded dynamically.
You can setup html5-parser to run from a source checkout as follows:
git clone https://github.com/kovidgoyal/html5-parser && cd html5-parser
pip install --no-binary lxml 'lxml>=3.8.0' --user
python setup.py develop --user
Windows
On Windows, installation is a little more involved. There is a 200 line script that is used to install html5-parser and all its dependencies on the windows continuous integration server. Using that script installation can be done by running the following commands in a Visual Studio 2015 Command prompt:
python.exe win-ci.py install_deps
python.exe win-ci.py test
This will install all dependencies and html5-parser in the sw sub-directory. You will need to add sw\bin to PATH and sw\python\Lib\site-packages to PYTHONPATH. Or copy the files into your system python’s directories.
Benchmarking
There is a benchmark script named benchmark.py that compares the parse times for parsing a large (~ 5.7MB) HTML document in html5lib and html5-parser. The results on my system show a speedup of 28x. The output from the script on my system is:
Testing with HTML file of 5,956,815 bytes
Parsing repeatedly with html5-parser
html5-parser took an average of : 0.491 seconds to parse it
Parsing repeatedly with html5lib
html5lib took an average of : 13.744 seconds to parse it
There is further potential for speedup. Currently the gumbo subsystem uses its own cache for tag and attribute names and the libxml2 sub-system uses its own cache. Unifying the two to use the libxml2 cache should yield significant performance and memory consumption gains.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.