This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

A HT/XML web scraping tool

Project Description
    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>  </ /_/ /  / /_/ / /__/ /_
/_/_/_.___/\___/_/|_|\__/_/   \__,_/\___/\__/

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.

Overview

libextract.api.extract(document, encoding=’utf-8’, count=5)
Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).

Installation

pip install libextract

Usage

Due to our simple definition of “data”, we open up a single interfaceable method. Post-processing is up to you.

from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

Using lxml’s built-in methods for post-processing:

>> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

The extraction algo is agnostic to article text as it is with tabular data:

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))
>> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

Dependencies

lxml
statscounter

Disclaimer

This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated

:)

Release History

Release History

This version
History Node

0.0.12

History Node

0.0.11

History Node

0.0.1

History Node

0.0.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
libextract-0.0.12.zip (49.8 kB) Copy SHA256 Checksum SHA256 Source Jun 28, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting