This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Extract embedded metadata from HTML markup

Project Description

extruct is a library for extracting embedded metadata from HTML markup.

It also has a built-in HTTP server to test its output as JSON.

Currently, extruct supports:

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Roadmap

Installation

pip install extruct

Usage

All-in-one extraction

The simplest example how to use extruct is to call extruct.extract(htmlstring, url) with some HTML string and a URL.

Let’s try this on a page on eBay which uses microdata and RDFa (with ogp).

First fetch the HTML using python-requests and then feed the response body to extruct:

>>> import requests
>>> from pprint import pprint

>>> r = requests.get('http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487')

>>> import extruct
>>> data = extruct.extract(r.text, r.url)

>>> pprint(data)
{'json-ld': [],
 'microdata': [{'properties': {'image': ['http://i.ebayimg.com/images/g/0M4AAOSwT-FZBeOQ/s-l300.jpg',
                                         'http://i.ebayimg.com/images/g/0M4AAOSwT-FZBeOQ/s-l300.jpg'],
                               'name': 'Details about  \xa0HERBERT TERRY 2 '
                                       'STEP ANGLEPOISE LAMP MODEL1227',
                               'offers': {'properties': {'areaServed': 'United '
                                                                       'Kingdom '
                                                                       'and '
                                                                       'many '
                                                                       'other '
                                                                       'countries \n'
                                                                       '\t\t\t\t\t\t'
                                                                       '|  See '
                                                                       'details',
                                                         'availability': 'http://schema.org/InStock',
                                                         'availableAtOrFrom': 'Stockport, '
                                                                              'United '
                                                                              'Kingdom',
                                                         'itemCondition': '--not '
                                                                          'specified',
                                                         'price': '150.0',
                                                         'priceCurrency': 'GBP'},
                                          'type': 'http://schema.org/Offer'}},
                'type': 'http://schema.org/Product'},
               {'properties': {'itemListElement': [{'properties': {'item': 'http://www.ebay.com/sch/Antiques-/20081/i.html',
                                                                   'name': 'Antiques',
                                                                   'position': '1'},
                                                    'type': 'http://schema.org/ListItem'},
                                                   (...)
                                                   {'properties': {'item': 'http://www.ebay.com/sch/20th-Century-/66861/i.html',
                                                                   'name': '20th '
                                                                           'Century',
                                                                   'position': '4'},
                                                    'type': 'http://schema.org/ListItem'}]},
                'type': 'http://schema.org/BreadcrumbList'}],
 'rdfa': [{'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487#w1-31-_topHelpTxt',
           'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
          (...)
          {'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487',
           'http://opengraphprotocol.org/schema/description': [{'@value': 'On '
                                                                          'one '
                                                                          'side '
                                                                          'of '
                                                                          'the '
                                                                          'base '
                                                                          'is '
                                                                          'a '
                                                                          'metal '
                                                                          'label '
                                                                          'from '
                                                                          'UMIST, '
                                                                          'where '
                                                                          'it '
                                                                          'was '
                                                                          'in '
                                                                          'use. '
                                                                          '| '
                                                                          'eBay!'}],
           'http://opengraphprotocol.org/schema/image': [{'@value': 'http://i.ebayimg.com/images/i/282478964487-0-1/s-l1000.jpg'}],
           'http://opengraphprotocol.org/schema/site_name': [{'@value': 'eBay'}],
           'http://opengraphprotocol.org/schema/title': [{'@value': 'HERBERT '
                                                                    'TERRY 2 '
                                                                    'STEP '
                                                                    'ANGLEPOISE '
                                                                    'LAMP '
                                                                    'MODEL1227  '
                                                                    '| eBay'}],
           'http://opengraphprotocol.org/schema/type': [{'@value': 'ebay-objects:item'}],
           'http://opengraphprotocol.org/schema/url': [{'@value': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487'}],
           'http://www.facebook.com/2008/fbmlapp_id': [{'@value': '102628213125203'}]},
          {'@id': '_:Na28391785e4e48bb92849fccbe758c6b',
           'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
          (...)
          {'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487#glbfooter',
           'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#contentinfo'}]}]}

Another example with a page from SongKick containing RDFa and JSON-LD metadata:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')

>>> data = extruct.extract(r.text, r.url)

>>> pprint(data)
{'json-ld': [{'@context': 'http://schema.org',
              '@type': 'MusicEvent',
              'location': {'@type': 'Place',
                           'address': {'@type': 'PostalAddress',
                                       'addressCountry': 'US',
                                       'addressLocality': 'Brooklyn',
                                       'addressRegion': 'NY',
                                       'postalCode': '11225',
                                       'streetAddress': '497 Rogers Ave'},
                           'geo': {'@type': 'GeoCoordinates',
                                   'latitude': 40.660109,
                                   'longitude': -73.953193},
                           'name': 'The Owl Music Parlor',
                           'sameAs': 'http://www.theowl.nyc'},
              'name': 'Elysian Fields',
              'performer': [{'@type': 'MusicGroup',
                             'name': 'Elysian Fields',
                             'sameAs': 'http://www.songkick.com/artists/236156-elysian-fields?utm_medium=organic&utm_source=microformat'}],
              'startDate': '2017-06-10T19:30:00-0400',
              'url': 'http://www.songkick.com/concerts/30173984-elysian-fields-at-owl-music-parlor?utm_medium=organic&utm_source=microformat'},
             (...)
             {'@context': 'http://schema.org',
              '@type': 'MusicGroup',
              'image': 'https://images.sk-static.com/images/media/profile_images/artists/236156/card_avatar',
              'interactionCount': '5557 UserLikes',
              'logo': 'https://images.sk-static.com/images/media/profile_images/artists/236156/card_avatar',
              'name': 'Elysian Fields',
              'url': 'http://www.songkick.com/artists/236156-elysian-fields?utm_medium=organic&utm_source=microformat'}],
 'microdata': [],
 'rdfa': [{'@id': 'http://www.songkick.com/artists/236156-elysian-fields',
           'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
           'al:ios:app_store_id': [{'@value': '438690886'}],
           'al:ios:url': [{'@value': 'songkick://artists/236156-elysian-fields'}],
           'http://ogp.me/ns#description': [{'@value': 'Buy tickets for an '
                                                       'upcoming Elysian '
                                                       'Fields concert near '
                                                       'you. List of all '
                                                       'Elysian Fields tickets '
                                                       'and tour dates for '
                                                       '2017.'}],
           'http://ogp.me/ns#image': [{'@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
           'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
           'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
           'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
           'http://ogp.me/ns#url': [{'@value': 'http://www.songkick.com/artists/236156-elysian-fields'}],
           'http://www.facebook.com/2008/fbmlapp_id': [{'@value': '308540029359'}]}]}

You can also use each extractor individually. See below.

Microdata extraction

>>> from pprint import pprint
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The house I found.',
                 'work': 'http://www.example.com/images/house.jpeg'},
  'type': 'http://n.whatwg.org/work'},
 {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The mailbox.',
                 'work': 'http://www.example.com/images/mailbox.jpeg'},
  'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

>>> from pprint import pprint
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pprint(data)
[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]

RDFa extraction (experimental)

>>> from pprint import pprint
>>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
  'parsers will not be available.')
>>>
>>> html = """<html>
...  <head>
...    ...
...  </head>
...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
...       <h2 property="dc:title">The trouble with Bob</h2>
...       ...
...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
...       <div property="schema:articleBody">
...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
...       </div>
...      ...
...    </div>
...  </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pprint(
...     rdfae.extract(html, url='http://www.example.com/index.html')
... )
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
  '@type': ['http://schema.org/BlogPosting'],
  'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
  'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
  'http://schema.org/articleBody': [{'@value': '\n'
                                               '        The trouble with Bob '
                                               'is that he takes much better '
                                               'photos than I do:\n'
                                               '      '}],
  'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You’ll get a list of expanded JSON-LD nodes.

REST API service

extruct also ships with a REST API service to test its output from URLs.

Dependencies

Usage

python -m extruct.service

launches an HTTP server listening on port 10005.

Methods supported

/extruct/<URL>
method = GET


/extruct/batch
method = POST
params:
    urls - a list of URLs separted by newlines
    urlsfile - a file with one URL per line

E.g. http://localhost:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412

will output something like this:

{
   "url":"http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
   "status":"ok",
   "microdata":[
         {
            "type":"http://schema.org/Product",
            "properties":{
               "name":"Susket",
               "color":[
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412"
               ],
               "brand":"http://www.sarenza.com/i-love-shoes",
               "aggregateRating":{
                  "type":"http://schema.org/AggregateRating",
                  "properties":{
                     "description":"Soyez le premier \u00e0 donner votre avis"
                  }
               },
               "offers":{
                  "type":"http://schema.org/AggregateOffer",
                  "properties":{
                     "lowPrice":"59,00 \u20ac",
                     "price":"A partir de\r\n                  59,00 \u20ac",
                     "priceCurrency":"EUR",
                     "highPrice":"59,00 \u20ac",
                     "availability":"http://schema.org/InStock"
                  }
               },
               "size":[
                  "36 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "37 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "38 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "39 - Derni\u00e8re paire !",
                  "40",
                  "41",
                  "42 - Derni\u00e8re paire !"
               ],
               "image":[
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_09.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_03.jpg?201509221045",
                  "http://cdn3.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_04.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_05.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_06.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_07.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_08.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_02.jpg?201509291747"
               ],
               "description":""
            }
         }
   ]
}

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies

The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:

pip install extruct[cli]

Usage

extruct "http://example.com"

Downloads “http://example.com” and outputs the Microdata, JSON-LD and RDFa metadata to stdout.

Supported Parameters

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD and RDFa). If you want to restrict the output to just one or a subset of those, you can use the individual switches.

For example, this command extracts only Microdata and JSON-LD metadata from “http://example.com”:

extruct --microdata --jsonld "http://example.com"

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox

Versioning

Use bumpversion to conveniently change project version:

bumpversion patch  # 0.0.0 -> 0.0.1
bumpversion minor  # 0.0.1 -> 0.1.0
bumpversion major  # 0.1.0 -> 1.0.0
Release History

Release History

This version
History Node

0.4.0

History Node

0.3.1

History Node

0.3.0

History Node

0.3.0a2

History Node

0.3.0a1

History Node

0.3.0a0

History Node

0.2.0

History Node

0.1.0

History Node

0.0.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
extruct-0.4.0-py2.py3-none-any.whl (10.0 kB) Copy SHA256 Checksum SHA256 py2.py3 Wheel Jun 20, 2017
extruct-0.4.0.tar.gz (11.8 kB) Copy SHA256 Checksum SHA256 Source Jun 20, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting