Skip to main content

Extracts OpenGraph, TwitterCard and Schema properties from a webpage.

Project description

web2preview

For a given URL web2preview extracts its title, description, and image url using Open Graph, Twitter Card, or Schema meta tags, or, as an alternative, parses it as a generic webpage.

PyPI - Python Version PyPI Build status Code coverage report

This is a fork of an excellent webpreview library and it maintains complete and absolute compatibility with the original while fixing several bugs, enhancing parsing, and adding a new convenient APIs.

Main differences between web2preview and webpreview:

  • Enhanced parsing for generic web pages
  • No unnecessary GET request is ever made if content of the page is supplied
  • Complete fallback mechanism which continues to parse until all methods are exhausted
  • Python Typings are added across the entire library (better syntax highlighting)
  • New dict-like WebPreview result object makes it easier to read parsing results
  • Command-line utility to extract title, description, and image from URL

Installation

pip install web2preview

Usage

Use the generic web2preview method to parse the page independent of its nature. It tries to extract the values from Open Graph properties, then it falls back to Twitter Card format, then Schema. If none of them can extract all three of the title, description, and preview image, then webpage's content is parsed using a generic extractor.

>>> from web2preview import web2preview

>>> p = web2preview("https://en.wikipedia.org/wiki/Enrico_Fermi")
>>> p.title
'Enrico Fermi - Wikipedia'
>>> p.description
'Italian-American physicist (1901–1954)'
>>> p.image
'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg'

# Access the parsed fields both as attributes and items
>>> p["url"] == p.url
True

# Check if all three of the title, description, and image are in the parsing result
>>> p.is_complete()
True

# Provide page content from somewhere else
>>> content = """
<html>
    <head>
        <title>The Dormouse's story</title>
        <meta property="og:description" content="A Mad Tea-Party story" />
    </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    </body>
</html>
"""

# This function call won't make any external calls,
# only relying on the supplied content, unlike the example above
>>> web2preview("aa.com", content=content)
WebPreview(url="http://aa.com", title="The Dormouse's story", description="A Mad Tea-Party story")

Using the command line

When web2preview is installed via pip the accompanying command-line tool is intalled alongside.

$ web2preview https://en.wikipedia.org/wiki/Enrico_Fermi
title: Enrico Fermi - Wikipedia
description: Italian-American physicist (1901–1954)
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg

$ web2preview https://github.com/ --absolute-url
title: GitHub: Where the world builds software
description: GitHub is where over 83 million developers shape the future of software, together.
image: https://github.githubassets.com/images/modules/site/social-cards/github-social.png

Note: For the Original webpreview API please check the official docs.

Run with Docker

The docker image can be built and ran similarly to the command line. The default entry point is the web2preview command-line function.

$ docker build -t web2preview .
$ docker run -it --rm web2preview "https://en.m.wikipedia.org/wiki/Enrico_Fermi"
title: Enrico Fermi - Wikipedia
description: Enrico Fermi (Italian: [enˈriːko ˈfermi]; 29 September 1901  28 November 1954) was an Italian (later naturalized American) physicist and the creator of the world's first nuclear reactor, the Chicago Pile-1. He has been called the "architect of the nuclear age"[1] and the "architect of the atomic bomb".
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg

Note: built docker image weighs around 210MB.

Testing

# Execute the tests
poetry run pytest web2preview

# OR execute until the first failed test
poetry run pytest web2preview -x

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2preview-1.1.1.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

web2preview-1.1.1-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file web2preview-1.1.1.tar.gz.

File metadata

  • Download URL: web2preview-1.1.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Darwin/21.5.0

File hashes

Hashes for web2preview-1.1.1.tar.gz
Algorithm Hash digest
SHA256 80a344c4323ef37ef5fbf45adbb1fead498bcca1781884b6b4f3d2a1d86297d1
MD5 4c92d571111a897a351805232ef167ce
BLAKE2b-256 ee632c8bcadac4eb0d362f76930c393245ae293f591e36397e3700cac27987e0

See more details on using hashes here.

File details

Details for the file web2preview-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: web2preview-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.2 Darwin/21.5.0

File hashes

Hashes for web2preview-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a46f595f5c66e0d730579d5f4761dad2e7fcffb2f8ad38fd374cc0d0e1b6ad07
MD5 5776405073ad7ac6b60462eca9351da7
BLAKE2b-256 f87174562e6d5586179823db156cf67bded1effcddeed27cfad54322a47c7efa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page