Skip to main content

Extract embedded metadata from HTML markup

Project description

https://img.shields.io/travis/scrapinghub/extruct.svg

extruct is a library for extracting embedded metadata from HTML markup.

It also has a built-in HTTP server to test its output as JSON.

Currently, extruct only supports W3C’s HTML Microdata and embedded JSON-LD.

Roadmap

Installation

pip install extruct

Usage

>>> from pprint import pprint
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pprint(data)
{'items': [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': u'The house I found.',
                           'work': 'http://www.example.com/images/house.jpeg'},
            'type': 'http://n.whatwg.org/work'},
           {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': u'The mailbox.',
                           'work': 'http://www.example.com/images/mailbox.jpeg'},
            'type': 'http://n.whatwg.org/work'}]}

REST API service

extruct also ships with a REST API service to test its output from URLs.

Dependencies

Usage

python -m extruct.service

launches an HTTP server listening on port 10005.

Methods supported

/extruct/<URL>
method = GET


/extruct/batch
method = POST
params:
    urls - a list of URLs separted by newlines
    urlsfile - a file with one URL per line

E.g. http://localhost:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412

will output something like this:

{
   "url":"http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
   "status":"ok",
   "microdata":{
      "items":[
         {
            "type":"http://schema.org/Product",
            "properties":{
               "name":"Susket",
               "color":[
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412"
               ],
               "brand":"http://www.sarenza.com/i-love-shoes",
               "aggregateRating":{
                  "type":"http://schema.org/AggregateRating",
                  "properties":{
                     "description":"Soyez le premier \u00e0 donner votre avis"
                  }
               },
               "offers":{
                  "type":"http://schema.org/AggregateOffer",
                  "properties":{
                     "lowPrice":"59,00 \u20ac",
                     "price":"A partir de\r\n                  59,00 \u20ac",
                     "priceCurrency":"EUR",
                     "highPrice":"59,00 \u20ac",
                     "availability":"http://schema.org/InStock"
                  }
               },
               "size":[
                  "36 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "37 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "38 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "39 - Derni\u00e8re paire !",
                  "40",
                  "41",
                  "42 - Derni\u00e8re paire !"
               ],
               "image":[
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_09.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_03.jpg?201509221045",
                  "http://cdn3.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_04.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_05.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_06.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_07.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_08.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_02.jpg?201509291747"
               ],
               "description":""
            }
         }
      ]
   }
}

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox

Versioning

Use bumpversion to conveniently change project version:

bumpversion patch  # 0.0.0 -> 0.0.1
bumpversion minor  # 0.0.1 -> 0.1.0
bumpversion major  # 0.1.0 -> 1.0.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extruct-0.1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

extruct-0.1.0-py2.py3-none-any.whl (5.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file extruct-0.1.0.tar.gz.

File metadata

  • Download URL: extruct-0.1.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for extruct-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0748fc97e171c1cf87c398cfd8199248a610e8d424a6f9b651471361814ff16d
MD5 6893a75bfbf6b30c5093b78dca0ddcc5
BLAKE2b-256 9124e1b7893e1c3f95ff1e6e0010c73fb5fc367dfe9ea49ee146d44a9ef99117

See more details on using hashes here.

Provenance

File details

Details for the file extruct-0.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for extruct-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d305b26b6ca627cd8ef12eda1f8cd6fb7867ef38be6bf7344fc1b6b2a27f7675
MD5 d9e2ee3f74466e2b762989e1aaba2462
BLAKE2b-256 20d7a90f860049238bdd20848fcd307d745301bdbb5b44bfc32d412b3394ecd6

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page