Skip to main content

A ridiculously simple HT/XML article-text extractor

Project description

https://travis-ci.org/datalib/libextract.svg?branch=master
    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>  </ /_/ /  / /_/ / /__/ /_
/_/_/_.___/\___/_/|_|\__/_/   \__,_/\___/\__/

Libextract is a statistical extraction library that works on HTML and XML documents, written in Python and originating from eatihit. The philosophy and aim is to provide declaratively composed, simple and pipelined functions for users to describe their extraction algorithms.

Overview

libextract.extract(doc)

Extracts text (by default) from a given HT/XML string doc. What is extracted and how it is extracted can be configured using the strategy parameter, which accepts an iterable of functions to be piped to one another (the result of the previous is the argument of the next).

Installation

pip install libextract

Usage

Extracting the text from a wikipedia page:

from requests import get
from libextract import extract

r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')
text = extract(r.content)

Getting the node that (most likely) contains the text nodes that contain the text of the article:

from libextract.strategies import ARTICLE_NODE

node = extract(r.content, strategy=ARTICLE_NODE)

To serialize the node into JSON format:

>>> from libextract.formatters import node_json
>>> node_json(node, depth=1)
{'children': [...],
 'class': ['mw-content-ltr'],
 'id': ['mw-content-text'],
 'tag': 'div',
 'text': None,
 'xpath': '/html/body/div[3]/div[3]/div[4]'}

Using tabular extraction to get the nodes containing tabular data present in the HT/XML document:

from libextract.strategies import TABULAR

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content, strategy=TABULAR))

To convert HT/XML element to python list

>>> from libextract.formatters import table_list
>>> table_list(tabs[0])
[['Country/Region',
  'Average male height',
  'Average female height',
  'Stature ratio (male to female)',
  'Sample population / age range',
  ...]]

Viewing the table in your browser:

from lxml.html import open_in_browser
open_in_browser(tabs[0])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libextract-0.0.1.zip (47.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page