Extract embedded metadata from HTML markup
Project description
extruct is a library for extracting embedded metadata from HTML markup.
It also has a built-in HTTP server to test its output as JSON.
Currently, extruct only supports W3C’s HTML Microdata and embedded JSON-LD.
The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.
Roadmap
support for RDFa Lite (e.g. Facebook Open Graph protocol metadata)
Installation
pip install extruct
Usage
Microdata extraction
>>> from pprint import pprint >>> >>> from extruct.w3cmicrodata import MicrodataExtractor >>> >>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items >>> html = """<!DOCTYPE HTML> ... <html> ... <head> ... <title>Photo gallery</title> ... </head> ... <body> ... <h1>My photos</h1> ... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses"> ... <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest."> ... <figcaption itemprop="title">The house I found.</figcaption> ... </figure> ... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses"> ... <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside."> ... <figcaption itemprop="title">The mailbox.</figcaption> ... </figure> ... <footer> ... <p id="licenses">All images licensed under the <a itemprop="license" ... href="http://www.opensource.org/licenses/mit-license.php">MIT ... license</a>.</p> ... </footer> ... </body> ... </html>""" >>> >>> mde = MicrodataExtractor() >>> data = mde.extract(html) >>> pprint(data) {'items': [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php', 'title': u'The house I found.', 'work': 'http://www.example.com/images/house.jpeg'}, 'type': 'http://n.whatwg.org/work'}, {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php', 'title': u'The mailbox.', 'work': 'http://www.example.com/images/mailbox.jpeg'}, 'type': 'http://n.whatwg.org/work'}]}
JSON-LD extraction
>>> from pprint import pprint >>> >>> from extruct.jsonld import JsonLdExtractor >>> >>> html = """<!DOCTYPE HTML> ... <html> ... <head> ... <title>Some Person Page</title> ... </head> ... <body> ... <h1>This guys</h1> ... <script type="application/ld+json"> ... { ... "@context": "http://schema.org", ... "@type": "Person", ... "name": "John Doe", ... "jobTitle": "Graduate research assistant", ... "affiliation": "University of Dreams", ... "additionalName": "Johnny", ... "url": "http://www.example.com", ... "address": { ... "@type": "PostalAddress", ... "streetAddress": "1234 Peach Drive", ... "addressLocality": "Wonderland", ... "addressRegion": "Georgia" ... } ... } ... </script> ... </body> ... </html>""" >>> >>> jslde = JsonLdExtractor() >>> >>> data = jslde.extract(html) >>> pprint(data) {'items': [{u'@context': u'http://schema.org', u'@type': u'Person', u'additionalName': u'Johnny', u'address': {u'@type': u'PostalAddress', u'addressLocality': u'Wonderland', u'addressRegion': u'Georgia', u'streetAddress': u'1234 Peach Drive'}, u'affiliation': u'University of Dreams', u'jobTitle': u'Graduate research assistant', u'name': u'John Doe', u'url': u'http://www.example.com'}]}
REST API service
extruct also ships with a REST API service to test its output from URLs.
Dependencies
Usage
python -m extruct.service
launches an HTTP server listening on port 10005.
Methods supported
/extruct/<URL> method = GET /extruct/batch method = POST params: urls - a list of URLs separted by newlines urlsfile - a file with one URL per line
E.g. http://localhost:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412
will output something like this:
{ "url":"http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412", "status":"ok", "microdata":{ "items":[ { "type":"http://schema.org/Product", "properties":{ "name":"Susket", "color":[ "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412", "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412" ], "brand":"http://www.sarenza.com/i-love-shoes", "aggregateRating":{ "type":"http://schema.org/AggregateRating", "properties":{ "description":"Soyez le premier \u00e0 donner votre avis" } }, "offers":{ "type":"http://schema.org/AggregateOffer", "properties":{ "lowPrice":"59,00 \u20ac", "price":"A partir de\r\n 59,00 \u20ac", "priceCurrency":"EUR", "highPrice":"59,00 \u20ac", "availability":"http://schema.org/InStock" } }, "size":[ "36 - Epuis\u00e9 - \u00catre alert\u00e9", "37 - Epuis\u00e9 - \u00catre alert\u00e9", "38 - Epuis\u00e9 - \u00catre alert\u00e9", "39 - Derni\u00e8re paire !", "40", "41", "42 - Derni\u00e8re paire !" ], "image":[ "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_09.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_03.jpg?201509221045", "http://cdn3.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_04.jpg?201509221045", "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_05.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_06.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_07.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_08.jpg?201509221045", "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_02.jpg?201509291747" ], "description":"" } } ] } }
Development version
mkvirtualenv extruct pip install -r requirements-dev.txt
Tests
Run tests in current environment:
py.test tests
Use tox to run tests with different Python versions:
tox
Versioning
Use bumpversion to conveniently change project version:
bumpversion patch # 0.0.0 -> 0.0.1 bumpversion minor # 0.0.1 -> 0.1.0 bumpversion major # 0.1.0 -> 1.0.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for extruct-0.2.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e67e40011fd208a1ac056cbd1a7e591dc1e45dd059939ecb546a95ab42a15b19 |
|
MD5 | 46f048d619d918dd7f8900b482ebaab5 |
|
BLAKE2b-256 | b695886f7d638752513b86b6a8e9b1d615b870d208cd925cd3a5cdfee51da083 |