Skip to main content

A HT/XML web scraping tool

Project description

https://travis-ci.org/datalib/libextract.svg?branch=master
    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>  </ /_/ /  / /_/ / /__/ /_
/_/_/_.___/\___/_/|_|\__/_/   \__,_/\___/\__/

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.

Overview

libextract.api.extract(document, encoding=’utf-8’, count=5)

Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).

Installation

pip install libextract

Usage

Due to our simple definition of “data”, we open up a single interfaceable method. Post-processing is up to you.

from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

Using lxml’s built-in methods for post-processing:

>> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

The extraction algo is agnostic to article text as it is with tabular data:

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))
>> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

Dependencies

lxml
statscounter

Disclaimer

This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated

:)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libextract-0.0.12.zip (49.8 kB view details)

Uploaded Source

File details

Details for the file libextract-0.0.12.zip.

File metadata

  • Download URL: libextract-0.0.12.zip
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for libextract-0.0.12.zip
Algorithm Hash digest
SHA256 053e846b235fc5dc1d7c8a0fa806207ba676631ebf3f30fb52fb6c6c1e0849cc
MD5 869acc9725a9d883a412c4e74ce400d3
BLAKE2b-256 88c8434eff3237cd0ddc21c45a1ae52de3e94a33aad9c55468ce79da5f93f10c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page