Skip to main content

Extract embedded metadata from HTML markup

Project description

https://img.shields.io/travis/scrapinghub/extruct/master.svg https://img.shields.io/codecov/c/github/scrapinghub/extruct/master.svg?maxAge=2592000

extruct is a library for extracting embedded metadata from HTML markup.

It also has a built-in HTTP server to test its output as JSON.

Currently, extruct supports:

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Roadmap

Installation

pip install extruct

Usage

All-in-one extraction

The simplest example how to use extruct is to call extruct.extract(htmlstring, url) with some HTML string and a URL.

Let’s try this on a page on eBay which uses microdata and RDFa (with ogp).

First fetch the HTML using python-requests and then feed the response body to extruct:

>>> import requests
>>> from pprint import pprint

>>> r = requests.get('http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487')

>>> import extruct
>>> data = extruct.extract(r.text, r.url)

>>> pprint(data)
{'json-ld': [],
 'microdata': [{'properties': {'image': ['http://i.ebayimg.com/images/g/0M4AAOSwT-FZBeOQ/s-l300.jpg',
                                         'http://i.ebayimg.com/images/g/0M4AAOSwT-FZBeOQ/s-l300.jpg'],
                               'name': 'Details about  \xa0HERBERT TERRY 2 '
                                       'STEP ANGLEPOISE LAMP MODEL1227',
                               'offers': {'properties': {'areaServed': 'United '
                                                                       'Kingdom '
                                                                       'and '
                                                                       'many '
                                                                       'other '
                                                                       'countries \n'
                                                                       '\t\t\t\t\t\t'
                                                                       '|  See '
                                                                       'details',
                                                         'availability': 'http://schema.org/InStock',
                                                         'availableAtOrFrom': 'Stockport, '
                                                                              'United '
                                                                              'Kingdom',
                                                         'itemCondition': '--not '
                                                                          'specified',
                                                         'price': '150.0',
                                                         'priceCurrency': 'GBP'},
                                          'type': 'http://schema.org/Offer'}},
                'type': 'http://schema.org/Product'},
               {'properties': {'itemListElement': [{'properties': {'item': 'http://www.ebay.com/sch/Antiques-/20081/i.html',
                                                                   'name': 'Antiques',
                                                                   'position': '1'},
                                                    'type': 'http://schema.org/ListItem'},
                                                   (...)
                                                   {'properties': {'item': 'http://www.ebay.com/sch/20th-Century-/66861/i.html',
                                                                   'name': '20th '
                                                                           'Century',
                                                                   'position': '4'},
                                                    'type': 'http://schema.org/ListItem'}]},
                'type': 'http://schema.org/BreadcrumbList'}],
 'rdfa': [{'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487#w1-31-_topHelpTxt',
           'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
          (...)
          {'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487',
           'http://opengraphprotocol.org/schema/description': [{'@value': 'On '
                                                                          'one '
                                                                          'side '
                                                                          'of '
                                                                          'the '
                                                                          'base '
                                                                          'is '
                                                                          'a '
                                                                          'metal '
                                                                          'label '
                                                                          'from '
                                                                          'UMIST, '
                                                                          'where '
                                                                          'it '
                                                                          'was '
                                                                          'in '
                                                                          'use. '
                                                                          '| '
                                                                          'eBay!'}],
           'http://opengraphprotocol.org/schema/image': [{'@value': 'http://i.ebayimg.com/images/i/282478964487-0-1/s-l1000.jpg'}],
           'http://opengraphprotocol.org/schema/site_name': [{'@value': 'eBay'}],
           'http://opengraphprotocol.org/schema/title': [{'@value': 'HERBERT '
                                                                    'TERRY 2 '
                                                                    'STEP '
                                                                    'ANGLEPOISE '
                                                                    'LAMP '
                                                                    'MODEL1227  '
                                                                    '| eBay'}],
           'http://opengraphprotocol.org/schema/type': [{'@value': 'ebay-objects:item'}],
           'http://opengraphprotocol.org/schema/url': [{'@value': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487'}],
           'http://www.facebook.com/2008/fbmlapp_id': [{'@value': '102628213125203'}]},
          {'@id': '_:Na28391785e4e48bb92849fccbe758c6b',
           'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
          (...)
          {'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487#glbfooter',
           'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#contentinfo'}]}]}

Another example with a page from SongKick containing RDFa and JSON-LD metadata:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')

>>> data = extruct.extract(r.text, r.url)

>>> pprint(data)
{'json-ld': [{'@context': 'http://schema.org',
              '@type': 'MusicEvent',
              'location': {'@type': 'Place',
                           'address': {'@type': 'PostalAddress',
                                       'addressCountry': 'US',
                                       'addressLocality': 'Brooklyn',
                                       'addressRegion': 'NY',
                                       'postalCode': '11225',
                                       'streetAddress': '497 Rogers Ave'},
                           'geo': {'@type': 'GeoCoordinates',
                                   'latitude': 40.660109,
                                   'longitude': -73.953193},
                           'name': 'The Owl Music Parlor',
                           'sameAs': 'http://www.theowl.nyc'},
              'name': 'Elysian Fields',
              'performer': [{'@type': 'MusicGroup',
                             'name': 'Elysian Fields',
                             'sameAs': 'http://www.songkick.com/artists/236156-elysian-fields?utm_medium=organic&utm_source=microformat'}],
              'startDate': '2017-06-10T19:30:00-0400',
              'url': 'http://www.songkick.com/concerts/30173984-elysian-fields-at-owl-music-parlor?utm_medium=organic&utm_source=microformat'},
             (...)
             {'@context': 'http://schema.org',
              '@type': 'MusicGroup',
              'image': 'https://images.sk-static.com/images/media/profile_images/artists/236156/card_avatar',
              'interactionCount': '5557 UserLikes',
              'logo': 'https://images.sk-static.com/images/media/profile_images/artists/236156/card_avatar',
              'name': 'Elysian Fields',
              'url': 'http://www.songkick.com/artists/236156-elysian-fields?utm_medium=organic&utm_source=microformat'}],
 'microdata': [],
 'rdfa': [{'@id': 'http://www.songkick.com/artists/236156-elysian-fields',
           'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
           'al:ios:app_store_id': [{'@value': '438690886'}],
           'al:ios:url': [{'@value': 'songkick://artists/236156-elysian-fields'}],
           'http://ogp.me/ns#description': [{'@value': 'Buy tickets for an '
                                                       'upcoming Elysian '
                                                       'Fields concert near '
                                                       'you. List of all '
                                                       'Elysian Fields tickets '
                                                       'and tour dates for '
                                                       '2017.'}],
           'http://ogp.me/ns#image': [{'@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
           'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
           'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
           'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
           'http://ogp.me/ns#url': [{'@value': 'http://www.songkick.com/artists/236156-elysian-fields'}],
           'http://www.facebook.com/2008/fbmlapp_id': [{'@value': '308540029359'}]}]}

You can also use each extractor individually. See below.

Microdata extraction

>>> from pprint import pprint
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The house I found.',
                 'work': 'http://www.example.com/images/house.jpeg'},
  'type': 'http://n.whatwg.org/work'},
 {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The mailbox.',
                 'work': 'http://www.example.com/images/mailbox.jpeg'},
  'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

>>> from pprint import pprint
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pprint(data)
[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]

RDFa extraction (experimental)

>>> from pprint import pprint
>>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
  'parsers will not be available.')
>>>
>>> html = """<html>
...  <head>
...    ...
...  </head>
...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
...       <h2 property="dc:title">The trouble with Bob</h2>
...       ...
...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
...       <div property="schema:articleBody">
...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
...       </div>
...      ...
...    </div>
...  </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pprint(
...     rdfae.extract(html, url='http://www.example.com/index.html')
... )
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
  '@type': ['http://schema.org/BlogPosting'],
  'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
  'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
  'http://schema.org/articleBody': [{'@value': '\n'
                                               '        The trouble with Bob '
                                               'is that he takes much better '
                                               'photos than I do:\n'
                                               '      '}],
  'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You’ll get a list of expanded JSON-LD nodes.

REST API service

extruct also ships with a REST API service to test its output from URLs.

Dependencies

Usage

python -m extruct.service

launches an HTTP server listening on port 10005.

Methods supported

/extruct/<URL>
method = GET


/extruct/batch
method = POST
params:
    urls - a list of URLs separted by newlines
    urlsfile - a file with one URL per line

E.g. http://localhost:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412

will output something like this:

{
   "url":"http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
   "status":"ok",
   "microdata":[
         {
            "type":"http://schema.org/Product",
            "properties":{
               "name":"Susket",
               "color":[
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412"
               ],
               "brand":"http://www.sarenza.com/i-love-shoes",
               "aggregateRating":{
                  "type":"http://schema.org/AggregateRating",
                  "properties":{
                     "description":"Soyez le premier \u00e0 donner votre avis"
                  }
               },
               "offers":{
                  "type":"http://schema.org/AggregateOffer",
                  "properties":{
                     "lowPrice":"59,00 \u20ac",
                     "price":"A partir de\r\n                  59,00 \u20ac",
                     "priceCurrency":"EUR",
                     "highPrice":"59,00 \u20ac",
                     "availability":"http://schema.org/InStock"
                  }
               },
               "size":[
                  "36 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "37 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "38 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "39 - Derni\u00e8re paire !",
                  "40",
                  "41",
                  "42 - Derni\u00e8re paire !"
               ],
               "image":[
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_09.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_03.jpg?201509221045",
                  "http://cdn3.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_04.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_05.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_06.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_07.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_08.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_02.jpg?201509291747"
               ],
               "description":""
            }
         }
   ]
}

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies

The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:

pip install extruct[cli]

Usage

extruct "http://example.com"

Downloads “http://example.com” and outputs the Microdata, JSON-LD and RDFa metadata to stdout.

Supported Parameters

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD and RDFa). If you want to restrict the output to just one or a subset of those, you can use the individual switches.

For example, this command extracts only Microdata and JSON-LD metadata from “http://example.com”:

extruct --microdata --jsonld "http://example.com"

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox

Versioning

Use bumpversion to conveniently change project version:

bumpversion patch  # 0.0.0 -> 0.0.1
bumpversion minor  # 0.0.1 -> 0.1.0
bumpversion major  # 0.1.0 -> 1.0.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extruct-0.4.0.tar.gz (11.8 kB view hashes)

Uploaded Source

Built Distribution

extruct-0.4.0-py2.py3-none-any.whl (10.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page