This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

extruct is a library for extracting embedded metadata from HTML markup.

It also has a built-in HTTP server to test its output as JSON.

Currently, extruct only supports W3C’s HTML Microdata and embedded JSON-LD.

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Roadmap

Installation

pip install extruct

Usage

Microdata extraction

>>> from pprint import pprint
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pprint(data)
{'items': [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': u'The house I found.',
                           'work': 'http://www.example.com/images/house.jpeg'},
            'type': 'http://n.whatwg.org/work'},
           {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': u'The mailbox.',
                           'work': 'http://www.example.com/images/mailbox.jpeg'},
            'type': 'http://n.whatwg.org/work'}]}

JSON-LD extraction

>>> from pprint import pprint
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pprint(data)
{'items': [{u'@context': u'http://schema.org',
            u'@type': u'Person',
            u'additionalName': u'Johnny',
            u'address': {u'@type': u'PostalAddress',
                         u'addressLocality': u'Wonderland',
                         u'addressRegion': u'Georgia',
                         u'streetAddress': u'1234 Peach Drive'},
            u'affiliation': u'University of Dreams',
            u'jobTitle': u'Graduate research assistant',
            u'name': u'John Doe',
            u'url': u'http://www.example.com'}]}

REST API service

extruct also ships with a REST API service to test its output from URLs.

Dependencies

Usage

python -m extruct.service

launches an HTTP server listening on port 10005.

Methods supported

/extruct/<URL>
method = GET


/extruct/batch
method = POST
params:
    urls - a list of URLs separted by newlines
    urlsfile - a file with one URL per line

E.g. http://localhost:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412

will output something like this:

{
   "url":"http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
   "status":"ok",
   "microdata":{
      "items":[
         {
            "type":"http://schema.org/Product",
            "properties":{
               "name":"Susket",
               "color":[
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412"
               ],
               "brand":"http://www.sarenza.com/i-love-shoes",
               "aggregateRating":{
                  "type":"http://schema.org/AggregateRating",
                  "properties":{
                     "description":"Soyez le premier \u00e0 donner votre avis"
                  }
               },
               "offers":{
                  "type":"http://schema.org/AggregateOffer",
                  "properties":{
                     "lowPrice":"59,00 \u20ac",
                     "price":"A partir de\r\n                  59,00 \u20ac",
                     "priceCurrency":"EUR",
                     "highPrice":"59,00 \u20ac",
                     "availability":"http://schema.org/InStock"
                  }
               },
               "size":[
                  "36 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "37 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "38 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "39 - Derni\u00e8re paire !",
                  "40",
                  "41",
                  "42 - Derni\u00e8re paire !"
               ],
               "image":[
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_09.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_03.jpg?201509221045",
                  "http://cdn3.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_04.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_05.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_06.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_07.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_08.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_02.jpg?201509291747"
               ],
               "description":""
            }
         }
      ]
   }
}

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox

Versioning

Use bumpversion to conveniently change project version:

bumpversion patch  # 0.0.0 -> 0.0.1
bumpversion minor  # 0.0.1 -> 0.1.0
bumpversion major  # 0.1.0 -> 1.0.0
Release History

Release History

0.3.0a0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2.0

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.0.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
extruct-0.2.0-py2.py3-none-any.whl (6.0 kB) Copy SHA256 Checksum SHA256 py2.py3 Wheel Sep 26, 2016
extruct-0.2.0.tar.gz (6.3 kB) Copy SHA256 Checksum SHA256 Source Sep 26, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting