Web Data Extraction Library Written in Python
Wextracto is a toolkit for command-line web data extraction.
$ pip install wextracto
Kicking the Tyres
$ echo -e "[wex]\nsitemaps=wex.sitemaps:urls_from_sitemaps" > entry_points.txt $ wex "http://www.ebay.com/robots.txt"
The documentation can be found here:
- Fix errors in PhantomJS responses
- Handle non utf-8 urls
- Ensure utf-8 is tried first even if not declared
- Support onInitialized in PhantomJS required modules
- Add –label argument for easy process-wide labelling
- Fix shutdown error caused by daemon thread for timeout with phantomjs
- Fix handling of directories in tarfiles read from stdin (-)
- Small fix to avoid non-integer status code when error occur with PhantomJS
- Support ‘params’ keyword argument on URL.get
- Fix bug in handling HTML comments when fixing numeric character references
- Fix bug when using nested Cache objects
- Add support for reading WARC response format
- Fix bug in handling of invalid numeric character references
- Allow utf-8 in HTTP headers (only applies to PY2)
- Fix bug in HTTP decode caused by magic bytes handling.
- Add magic_bytes to Response for more reliable wex.http:decode behaviour.
- Re-worked encoding for HTML to pre-parse
- Better proxy support
- Now we flatten labels and values.
- href and src become href_url and src_url.
- Some API changes + switch to “tab-separated JSON”.
- Uploaded sdist to PyPI for “pip install wextracto” simplicity.
- Initial release as open source
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|File Name & Checksum SHA256 Checksum Help||Version||File Type||Upload Date|
|Wextracto-0.14.1-py2.py3-none-any.whl (50.6 kB) Copy SHA256 Checksum SHA256||2.7||Wheel||Aug 17, 2017|
|Wextracto-0.14.1.tar.gz (45.0 kB) Copy SHA256 Checksum SHA256||–||Source||Aug 17, 2017|