Web Data Extraction Library Written in Python
Wextracto is a toolkit for command-line web data extraction.
$ pip install wextracto
Kicking the Tyres
$ echo -e "[wex]\nsitemaps=wex.sitemaps:urls_from_sitemaps" > entry_points.txt $ wex "http://www.ebay.com/robots.txt"
The documentation can be found here:
- Add warc_protocol, warc_version, warc_headers to wex response
- Some (partial) support for Python 2.6
- Fix errors in PhantomJS responses
- Handle non utf-8 urls
- Ensure utf-8 is tried first even if not declared
- Support onInitialized in PhantomJS required modules
- Add –label argument for easy process-wide labelling
- Fix shutdown error caused by daemon thread for timeout with phantomjs
- Fix handling of directories in tarfiles read from stdin (-)
- Small fix to avoid non-integer status code when error occur with PhantomJS
- Support ‘params’ keyword argument on URL.get
- Fix bug in handling HTML comments when fixing numeric character references
- Fix bug when using nested Cache objects
- Add support for reading WARC response format
- Fix bug in handling of invalid numeric character references
- Allow utf-8 in HTTP headers (only applies to PY2)
- Fix bug in HTTP decode caused by magic bytes handling.
- Add magic_bytes to Response for more reliable wex.http:decode behaviour.
- Re-worked encoding for HTML to pre-parse
- Better proxy support
- Now we flatten labels and values.
- href and src become href_url and src_url.
- Some API changes + switch to “tab-separated JSON”.
- Uploaded sdist to PyPI for “pip install wextracto” simplicity.
- Initial release as open source
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, Size & Hash SHA256 Hash Help||File Type||Python Version||Upload Date|
(50.9 kB) Copy SHA256 Hash SHA256
|Wheel||2.7||Oct 24, 2017|
(45.3 kB) Copy SHA256 Hash SHA256
|Source||None||Oct 24, 2017|