Python HTML/XML parser for easy web scraping.
Project description
What is it?
DHTMLParser is a lightweight HTML/XML parser created for one purpose - quick and easy picking selected tags from DOM.
It can be very useful when you are in need to write own “guerilla” API for some webpage, or a scrapper.
If you want, you can also create HTML/XML documents more easily than by joining strings.
Documentation
Full module documentation can be found here: http://pyDHTMLParser.rtfd.org
Changelog
2.2.3
2020-04-12 Fix by #25 (thx https://github.com/fm4d).
2.2.2
Attempt to fix strange recursive inheritance problem.
2.2.0
Rewritten for compatibility with python3.
2.1.0 - 2.1.8
State parser fixed - it can now recover from invalid html like <invalid tag=something">.
Rewritten to use StateEnum in parser for better readability.
Garbage collector is now disabled during _raw_split().
Fixed #16 - recovery after tags which don’t ends with > (</code for example).
Closed #17 - implementation of ignoring of < in usage as is smaller than sign.
Restored support of multiline attributes.
.parseString() now doesn’t try to parse HTML element parameters.
Implemented first() getter.
License changed to MIT.
Fixed #18: bug which in some cases caused invalid output.
Added HTMLElement.__repr__().
Added test_coverage.sh.
Added extended test_equality() coverage.
Formatting improvements.
Improved constructor handling, which is now much more readable.
Updated formatting of the setup.py.
Added more tests.
Fixed #22; bug in the SpecialDict.
Fixed some nasty unicode problems.
Fixed python 2 / 3 problem in docs/__init__.py.
getVersion() -> get_version().
2.0.10
Added more tests of removeTags().
run_tests.sh now gets arguments.
Check for string in removeTags() changed to basestring from str.
2.0.6 - 2.0.9
Fixed behaviour of toString() and tagToString().
SpecialDict is now derived from OrderedDict.
Changed and added tests of .params attribute (OrderedDict is now used).
Fixed bug in _repair_tags().
Removed _repair_tags() - it wasn’t really necessary.
Fixed nasty bug which could cause invalid XML output.
2.0.1 - 2.0.5
Fixed bugs in .match().
Fixed broken links in documentation.
Fixed bugs in .isAlmostEqual().
.find(); Fixed bug which prevented tag_name to be None.
Added op .__eq__() to the SpecialDict.
Added new method .containsParamSubset() to HTMLElement.
2.0.0
Rewritten, refactored, splitted to multiple files.
Added unittest coverage of almost 100% of the code.
Added better selector methods (.wfind(), .match)
Added Sphinx documentation.
Fixed a lot of bugs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyDHTMLParser-2.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46c5dd6e82378d81ba6e666902eea95ae611a113eaea7c9f91a47c454778ba37 |
|
MD5 | 666f672f736ff019deb355b21fdbde67 |
|
BLAKE2b-256 | 5b9f9d91e41eb7810483f67ee675241f538c6fba9badd9ef064886ec3997d9fd |