A HT/XML web scraping tool
Project description
___ __ __ __ / (_) /_ ___ _ __/ /__________ ______/ /_ / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/ / / / /_/ / __/> </ /_/ / / /_/ / /__/ /_ /_/_/_.___/\___/_/|_|\__/_/ \__,_/\___/\__/
Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.
Overview
- libextract.api.extract(document, encoding=’utf-8’, count=5)
Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).
Installation
pip install libextract
Usage
Due to our simple definition of “data”, we open up a single interfaceable method. Post-processing is up to you.
from requests import get
from libextract.api import extract
r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))
Using lxml’s built-in methods for post-processing:
>> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...
The extraction algo is agnostic to article text as it is with tabular data:
height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))
>> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
'Average male height',
'Average female height',
...]
Dependencies
lxml statscounter
Disclaimer
This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated
:)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file libextract-0.0.12.zip
.
File metadata
- Download URL: libextract-0.0.12.zip
- Upload date:
- Size: 49.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 053e846b235fc5dc1d7c8a0fa806207ba676631ebf3f30fb52fb6c6c1e0849cc |
|
MD5 | 869acc9725a9d883a412c4e74ce400d3 |
|
BLAKE2b-256 | 88c8434eff3237cd0ddc21c45a1ae52de3e94a33aad9c55468ce79da5f93f10c |