Skip to main content

Extracts OpenGraph, TwitterCard and Schema properties from a webpage.

Project description

webpreview

For a given URL, webpreview extracts its title, description, and image url using Open Graph, Twitter Card, or Schema meta tags, or, as an alternative, parses it as a generic webpage.

PyPI - Python Version PyPI Build status Code coverage report

Installation

pip install webpreview

Usage

Use the generic webpreview method (added in v1.7.0) to parse the page independent of its nature. This method fetches a page and tries to extracts a title, description, and a preview image from it.

It first attempts to parse the values from Open Graph properties, then it falls back to Twitter Card format, and then to Schema. If none of these methods succeed in extracting all three properties, then the web page's content is parsed using a generic HTML parser.

>>> from webpreview import webpreview

>>> p = webpreview("https://en.wikipedia.org/wiki/Enrico_Fermi")
>>> p.title
'Enrico Fermi - Wikipedia'
>>> p.description
'Italian-American physicist (1901–1954)'
>>> p.image
'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg'

# Access the parsed fields both as attributes and items
>>> p["url"] == p.url
True

# Check if all three of the title, description, and image are in the parsing result
>>> p.is_complete()
True

# Provide page content from somewhere else
>>> content = """
<html>
    <head>
        <title>The Dormouse's story</title>
        <meta property="og:description" content="A Mad Tea-Party story" />
    </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    </body>
</html>
"""

# The the function's invocation won't make any external calls,
# only relying on the supplied content, unlike the example above
>>> webpreview("aa.com", content=content)
WebPreview(url="http://aa.com", title="The Dormouse's story", description="A Mad Tea-Party story")

Using the command line

When webpreview is installed via pip, then the accompanying command-line tool is installed alongside.

$ webpreview https://en.wikipedia.org/wiki/Enrico_Fermi
title: Enrico Fermi - Wikipedia
description: Italian-American physicist (1901–1954)
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg

$ webpreview https://github.com/ --absolute-url
title: GitHub: Where the world builds software
description: GitHub is where over 83 million developers shape the future of software, together.
image: https://github.githubassets.com/images/modules/site/social-cards/github-social.png

Using compatibility API

Before v1.7.0 the package mainly exposed a different set of the API methods. All of them are supported and may continue to be used.

# WARNING:
# The API below is left for BACKWARD COMPATIBILITY ONLY.

from webpreview import web_preview
title, description, image = web_preview("aurl.com")

# specifing timeout which gets passed to requests.get()
title, description, image = web_preview("a_slow_url.com", timeout=1000)

# passing headers
headers = {'User-Agent': 'Mozilla/5.0'}
title, description, image = web_preview("a_slow_url.com", headers=headers)

# pass html content thus avoiding making http call again to fetch content.
content = """<html><head><title>Dummy HTML</title></head></html>"""
title, description, image = web_preview("aurl.com", content=content)

# specifing the parser
# by default webpreview uses 'html.parser'
title, description, image = web_preview("aurl.com", content=content, parser='lxml')

Run with Docker

The docker image can be built and ran similarly to the command line. The default entry point is the webpreview command-line function.

$ docker build -t webpreview .
$ docker run -it --rm webpreview "https://en.m.wikipedia.org/wiki/Enrico_Fermi"
title: Enrico Fermi - Wikipedia
description: Enrico Fermi (Italian: [enˈriːko ˈfermi]; 29 September 1901  28 November 1954) was an Italian (later naturalized American) physicist and the creator of the world's first nuclear reactor, the Chicago Pile-1. He has been called the "architect of the nuclear age"[1] and the "architect of the atomic bomb".
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg

Note: built docker image weighs around 210MB.

Testing

# Execute the tests
poetry run pytest webpreview

# OR execute until the first failed test
poetry run pytest webpreview -x

Setting up development environment

# Install a correct minimal supported version of python
pyenv install 3.7.13

# Create a virtual environment
# By default, the project already contains a .python-version file that points
# to 3.7.13.
python -m venv .venv

# Install dependencies
# Poetry will automatically install them into the local .venv
poetry install

# If you have errors likes this:
ERROR: Can not execute `setup.py` since setuptools is not available in the build environment.

# Then do this:
.venv/bin/pip install --upgrade setuptools

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webpreview-1.7.2.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

webpreview-1.7.2-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file webpreview-1.7.2.tar.gz.

File metadata

  • Download URL: webpreview-1.7.2.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.10.6 Linux/5.15.0-1014-azure

File hashes

Hashes for webpreview-1.7.2.tar.gz
Algorithm Hash digest
SHA256 dbec2ad5eddc0202d1989aa4db593179b075a60d802f50e12b4b7d3a92c2e232
MD5 8432d022f20ca6872680ef6c4237b630
BLAKE2b-256 6e81c8ae4f53ba30a3d36b47c128a3e723e1fa6159a7208655283dcaf73f8d05

See more details on using hashes here.

File details

Details for the file webpreview-1.7.2-py3-none-any.whl.

File metadata

  • Download URL: webpreview-1.7.2-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.10.6 Linux/5.15.0-1014-azure

File hashes

Hashes for webpreview-1.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 531c56bc01e33b8857525d017b1c3cbc324c5ba6c81f5b2ee140b6b7f9bc76ed
MD5 d0a9e9ce2d7c09974c3ddcc00b1b719c
BLAKE2b-256 103823101f5e9718d05d07c895beac50d63e4c6146ef2b02e51f04d48a635015

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page