Skip to main content

Extract structured data from HTML and XML like a boss.

Project description

NOTE: xextract is still under construction.

Extract structured data from HTML and XML like a boss.

xextract is simple enough for writing a one-line parser, yet powerful enough to be a part of a big project.

Features

  • Simple declarative style of parsers

  • Parsing of HTML and XML documents

  • Supports xpath and css selectors

  • Built-in self-validation to let you know when the structure of the website has changed

  • Speed - under the hood the library uses lxml library with compiled xpath selectors

Table of Contents

A little taste of it

# fetch the website
import requests
response = requests.get('https://www.linkedin.com/in/barackobama')

# parse like a boss
from xextract import Prefix, Group, String, Url
Prefix(css='#profile', children=[
    String(name='name', css='.full-name', quant=1),
    String(name='title', css='.title', quant=1),
    Group(name='jobs', css='#background-experience .section-item', quant='*', children=[
        String(name='title', css='h4', quant=1),
        String(name='company', css='h5', quant=1, attr='_all_text'),
        Url(name='company_url', css='h5 a', quant='?'),
        String(name='description', css='.description', quant='?')
    ])
]).parse(response.text, url=response.url)

Output:

{'name': u'Barack Obama',
 'title': u'President of the United States of America',
 'jobs': [
     {'company': u'United States of America',
      'company_url': None,
      'description': u'I am serving as the 44th President of the United States of America.',
      'title': u'President'},
     {'company': u'US Senate (IL-D)',
      'company_url': u'http://www.linkedin.com/company/4560?trk=ppro_cprof',
      'description': u'In the U.S. Senate, I sought to focus on tackling the challenges of a globalized, 21st century world with fresh thinking and a politics that no longer settles for the lowest common denominator.',
      'title': u'US Senator'},
     {'company': u'Illinois State Senate',
      'company_url': None,
      'description': u"Proudly representing the 13th District on Chicago's south side.",
      'title': u'State Senator'},
     {'company': u'University of Chicago Law School',
      'company_url': u'http://www.linkedin.com/company/3878?trk=ppro_cprof',
      'description': None,
      'title': u'Senior Lecturer in Law'}]}

Installation

To install xextract, simply:

$ pip install xextract

Requirements: six, lxml, cssselect

Supported Python versions are 2.6, 2.7, 3.x.

Basic usage

In the examples below we will demonstrate how to parse the data from a Linkedin profile, so include the following code to the top of the file:

from xextract import *
import requests
response = requests.get('https://www.linkedin.com/in/barackobama')
html, url = response.text, response.url

To extract the name from a Linkedin profile, call:

>>> String(name='name', css='.full-name', quant=1).parse(html)
{'name': u'Barack Obama'}

You can see that the parsed data are returned in a dictionary.

Parameters we passed to the parser have the following meaning:

  • name (required) - dictionary key under which to store the parsed data.

  • css (required) - css selector to the HTML element containing the data.

  • quant (optional) - number of HTML elements we expect to match with the css selector. In the above case we expect exactly one element. If the number of elements doesn’t match, ParsingError exception is raised:

    >>> String(name='name', css='.full-name', quant=2).parse(html)
    xextract.selectors.ParsingError: Number of "name" elements, 1, does not match the expected quantity "2".

If you don’t pass quant parameter, two things will happen. First, there will be no validation on the number of matched elements, i.e. you can match zero or more elements and no exception is raised. Second, the extracted data will be returned as a (possibly empty) list of values (for more details see quant reference):

>>> String(name='name', css='.full-name').parse(html)
{'name': [u'Barack Obama']}  # note that the extracted data are in the list

In the previous example we could have used xpath instead of css selector:

>>> String(name='name', xpath='//*[@class="full-name"]', quant=1).parse(html)
{'name': u'Barack Obama'}

By default, String extracts the text content of the element. To extract the data from an HTML attribute, use attr parameter:

>>> String(name='css-class', css='span', quant=1, attr='class').parse('<span class="hello"></span>')
{'css-class': u'hello'}

To extract the whole text “500+ connections” from the following HTML structure:

<div class="member-connections">
    <strong>500+</strong>
    connections
</div>

By default, String parser extracts only the text directly from the matched elements, but not their descendants:

>>> String(name='connections', css='.member-connections', quant=1).parse(html)
{'connections': u' connections'}

To extract and concatenate the text out of every descendant element, use attr parameter with the special value ‘_all_text’:

>>> String(name='connections', css='.member-connections', quant=1, attr='_all_text').parse(html)
{'connections': u'500+ connections'}

To extract the url of the profile picture, use Url parser instead of String:

>>> Url(name='profile-picture', css='.profile-picture img', quant=1, attr='src').parse(html, url=url)
{'profile-picture': u'https://media.licdn.com/mpr/mpr/shrink_200_200/p/2/000/1a3/129/3a73f4c.jpg'}

When you use Url parser and pass url parameter to parse() method, the parser will extract the absolute url address. By default, Url extracts the value out of href attribute of the matched element. If you want to extract the value out of a different attribute (e.g. src), pass it as attr parameter.


To extract the list of jobs and from each job to store the company name and the job title, use Group parser to group the job data together:

>>> Group(name='jobs', css='#background-experience .section-item', quant='+', children=[
...     String(name='title', css='h4', quant=1),
...     String(name='company', css='h5', quant=1, attr='_all_text')
... ]).parse(html)
{'jobs': [
    {'company': u'United States of America', 'title': u'President'},
    {'company': u'US Senate (IL-D)', 'title': u'US Senator'},
    {'company': u'Illinois State Senate', 'title': u'State Senator'},
    {'company': u'University of Chicago Law School', 'title': u'Senior Lecturer in Law'}]}

In this case the Group matched four elements, each of those containing a single h4 and h5 elements.

Parser reference

String

Parameters: name (required), css / xpath (required), quant (optional, default '*'), attr (optional, default '_text'), namespaces (optional)

Returns the raw string extracted from the matched element. Returned value is always unicode.

Use attr parameter to extract the data from an HTML attribute.

By default, String extracts the text content of the matched element, but not its descendants. To extract and concatenate the text out of every descendant element, use attr parameter with the special value ‘_all_text’:

Example:

>>> String(name='text', css='span', quant=1).parse('<span>Hello <b>world!</b></span>')
{'text': u'Hello '}

>>> String(name='text', css='span', quant=1, attr='_all_text').parse('<span>Hello <b>world!</b></span>')
{'text': u'Hello world!'}

Url

Parameters: name (required), css / xpath (required), quant (optional, default '*'), attr (optional, default 'href'), namespaces (optional)

Behaves like String parser, but with two exceptions:

  • default value for attr parameter is 'href'

  • if you pass url parameter to parse() method, the absolute urls will be extracted and returned

Example:

>>> html = '<a href="/test">Link</a>'
>>> Url(name='url', css='a', quant=1).parse(html)
{'url': u'/test'}  # without url passed, Url parser behaves just like the String parser

>>> Url(name='url', css='a', quant=1).parse(html, url='http://github.com/Mimino666')
{'url': u'http://github.com/test'}  # absolute url address. Told ya!

DateTime

Parameters: name (required), css / xpath (required), format (required), quant (optional, default '*'), attr (optional, default '_text'), namespaces (optional)

Returns the datetime object constructed out of the parsed data with: datetime.strptime(value, format).

Use format parameter to specify how to parse the datetime object. Syntax is described in the Python documentation.

Example:

>>> DateTime(name='christmas', css='span', quant=1, format='%d.%m.%Y').parse('<span>24.12.2015</span>')
{'christmas': datetime.datetime(2015, 12, 24, 0, 0)}

Element

Parameters: name (required), css / xpath (required), quant (optional, default '*'), namespaces (optional)

Returns the instance of lxml.etree._Element.

This parser doesn’t extract any value, but returns the matched element itself.

Example:

>>> Element(name='span', css='span', quant=1).parse('<span>Hello</span>')
{'span': <Element span at 0x2ac2990>}

Group

Parameters: name (required), css / xpath (required), children (required), quant (optional, default '*'), namespaces (optional)

Returns the dictionary containing the data extracted by the parsers listed in children parameter.

Typical use case for this parser is when you want to parse a list of user profiles and each profile further contains additional fields like name, address, etc. Use Group parser to group the fields of each profile together.

Example:

>>> html = '<ul><li id="id1">Hello</li> <li id="id2">world!</li></ul>'
>>> Group(name='data', css='li', quant=2, children=[
...     String(name='id', xpath='self::*', quant=1, attr='id'),
...     String(name='text', xpath='self::*', quant=1)
... ]).parse(html)
{'data': [
    {'text': u'Hello', 'id': u'id1'},
    {'text': u'world!', 'id': u'id2'}]}

Prefix

Parser parameters

name

Parsers: String, Url, DateTime, Element, Group

Specifies the dictionary key under which to store the extracted data.

If multiple parsers under the Group or Prefix parser have the same name, the behavior is undefined.

css / xpath

Parsers: String, Url, DateTime, Element, Group, Prefix

Use either css or xpath parameter (but not both) to select the elements from which to extract the data.

Under the hood css selectors are translated into equivalent xpath selectors.

For the children of the Prefix or Group parsers the elements are selected relative to the elements matched by the parent parser. For example:

Prefix(xpath='//*[@id="profile"]', children=[
    # equivalent to: //*[@id="profile"]/descendant-or-self::*[@class="full-name"]
    String(name='name', css='.full-name', quant=1),
    # equivalent to: //*[@id="profile"]/*[@class="title"]
    String(name='title', xpath='*[@class="title"]', quant=1),
    # equivalent to: //*[@class="subtitle"]
    String(name='subtitle', xpath='//*[@class="subtitle"]', quant=1)
])

quant

Parsers: String, Url, DateTime, Element, Group

Default value: '*'

Number of matched elements is validated against the quant parameter. If the number of elements doesn’t match the expected quantity, ParsingError exception is raised. In practice you can use this to be notified when the website changed its HTML structure.

Syntax for quant mimics the regular expressions. You can either pass the value as a string, single integer or tuple of two integers.

Depending on the value of quant, the extracted data are returned either as a single value or a list of values.

Value of quant

Meaning

Extracted data

'*' (default)

Zero or more elements.

List of values

'+'

One or more elements.

List of values

'?'

Zero or one element.

Single value or None

num

Exactly num elements.

You can pass either string or integer.

num == 0: None

num == 1: Single value

num > 1: List of values

(num1, num2)

Number of elements has to be between num1 and num2, inclusive.

You can pass either a string or tuple.

List of values

Examples:

>>> String(name='name', css='.name', quant=1).parse(html)
{'name': u'Barack Obama'}

>>> String(name='name', css='.name', quant='1').parse(html)  # same as above
{'name': u'Barack Obama'}

>>> String(name='name', css='.name', quant=(1,2)).parse(html)
{'name': [u'Barack Obama']}

>>> String(name='name', css='.name', quant='1,2').parse(html)  # same as above
{'name': [u'Barack Obama']}

>>> String(name='middle-name', css='.middle', quant='?').parse(html)
{'middle-name': None}

>>> String(name='job-titles', css='#background-experience .section-item h4', quant='+').parse(html)
{'job-titles': [u'President', u'US Senator', u'State Senator', u'Senior Lecturer in Law']}

>>> String(name='friends', css='.friend', quant='*').parse(html)
{'friends': []}

>>> String(name='friends', css='.friend', quant='+').parse(html)
xextract.selectors.ParsingError: Number of "friends" elements, 0, does not match the expected quantity "+".

attr

Parsers: String, Url, DateTime

Default value: 'href' for Url parser. '_text' otherwise.

Use attr parameter to specify what to extract from the matched element.

Value of attr

Meaning

'_text'

Extract the text content of the matched element.

'_all_text'

Extract and concatenate the text content of the matched element and all its descendants.

att_name

Extract the value out of att_name attribute of the matched element.

If such attribute doesn’t exist, empty string is returned.

For the following HTML structure:

<span class="name">Barack <strong>Obama</strong> III.</span>
<a href="/test">Link</a>

He are a few examples:

>>> String(name='name', css='.name', quant=1).parse(html)
{'name': u'Barack  III.'}

>>> String(name='name', css='.name', quant=1, attr='_text').parse(html)  # same as above
{'name': u'Barack  III.'}

>>> String(name='full-name', css='.name', quant=1, attr='_all_text').parse(html)
{'full-name': u'Barack Obama III.'}

>>> String(name='link', css='a', quant='1').parse(html)  # String extracts text content by default
{'link': u'Link'}

>>> Url(name='link', css='a', quant='1').parse(html)  # Url extracts href by default
{'link': u'/test'}

>>> String(name='id', css='a', quant='1', attr='id').parse(html)  # non-existent attributes return empty string
{'id': u''}

children

Parsers: Group, Prefix

namespaces

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xextract-0.0.7.zip (23.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page