Skip to main content

Collects metadata from URL, or HTML content.

Project description

Html Meta Data Parse

Code style: black pre-commit isort bandit PyPI version

About

html-meta-data-parse, collects metadata from URL, or HTML content.

Usage

Python Version: 3.8+

Setup

$ make .venv
$ make clean # cleans virtual environment folder

Setup virtual environment

Pre-commit

pre-commit installed automatically via .venv, used for linting best practices.

$ make pre-commit

Test

$ make test

Install

pip install html-meta-data-parse

Example

from html_meta_data_parse import HtmlMetaDataParse
html_meta_data_parse = HtmlMetaDataParse()
html_meta_data_parse.get_meta_data_by_url('https://example.com/')

>>> html_meta_data_parse.get_meta_data_by_url("https://www.pcmag.com/news/cloudflare-mitigates-nearly-2-tbps-ddos-attack")
{
  'title': 'Cloudflare Mitigates Nearly 2 Tbps DDoS Attack',
  'image': 'https://i.pcmag.com/imagery/articles/00NczM1wpOM7qFzLIwNp6XG-1.fit_lim.size_1200x630.v1636923971.jpg',
  'content': 'The attack was reportedly launched from approximately 15,000 devices.',
  'type': 'article',
  'twitter_handle': '@pcmag',
  'site_name': 'PCMAG',
  'url': 'https://www.pcmag.com/news/cloudflare-mitigates-nearly-2-tbps-ddos-attack'
}

>>> html_meta_data_parse.get_meta_data_by_url("https://www.cnet.com/tech/mobile/how-the-covid-19-pandemic-shaped-samsungs-new-galaxy-phone-update-launching-today/")
{
  'author': 'https://www.facebook.com/cnet',
  'title': 'Samsung knows the pandemic changed tech, so Galaxy phones are changing too',
  'image': 'https://www.cnet.com/a/img/h15nl2OCT89fWO9h_-Jza3vf5w8=/0x0:4000x2667/1200x630/right/top/2021/01/20/249ee601-c66f-48c2-84c2-fbc7d1606c61/109-samsung-galaxy-s21-and-s21-ultra-comparison.jpg',
  'content': "The company's decisions were affected by our evolving relationship with our phones.",
  'type': 'article',
  'twitter_handle': '@CNET',
  'site_name': 'CNET',
  'url': 'https://www.cnet.com/tech/mobile/how-the-covid-19-pandemic-shaped-samsungs-new-galaxy-phone-update-launching-today/'
}

import requests
res = requests.get("https://example.com/")
html_meta_data_parse.get_meta_data_by_html(res.text)


html_meta_data_parse = HtmlMetaDataParse(url="https://example.com/", proxy=<proxy_dict>)
html_meta_data_parse.get_meta_data_by_url()

Attributes

Functions

# url is required
html_meta_data_parse.get_meta_data_by_url(url)

# html_text is required
html_meta_data_parse.get_meta_data_by_html(html_text=html_text)
Override Meta Keys

HtmlMetaDataParse uses a predefined set of keys to parse meta data from html content. However it also provides an option to override meta keys of your choice.


html_meta_data_parse.get_meta_data_by_url(
  url,
  override_meta_keys
 )


html_meta_data_parse.get_meta_data_by_html(
  html_text,
  override_meta_keys,
)

#meta_keys_sample
meta_keys = {
        "author": {
            "name": [
                "author"
            ],
            "property": [
                "bt:author",
                "article:publisher",
                "dcterms.creator"
            ],
            "itemprop": [
                "author",
            ]

        },

        "title": {
            "name": [
                "title",
                "dcterms.title",
                "",
                "twitter:title"
            ],
            "property": [
                "og:title"
            ],
            "itemprop": [
                "title",
            ]
        },

        "image": {
            "name": [
                "image",
                "twitter:image",
                "thumbnail"
            ],
            "property": [
                "og:image"
            ],
            "itemprop": [
                "image",
            ]
        },

        "content": {
            "name": [
                "description",
                "twitter:description",
                "twitter:image:alt"
            ],
            "property": [
                "og:description",
                "og:image:alt"
            ],
            "itemprop": [
                "description",
            ]
        }
   }

Deploy

Increment version in setup.py

$ make deploy STAGE=testpypi # test

$ make deploy STAGE=pypi # public

Authors

  • Immanuel George - Initial work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_meta_data_parse-0.0.32.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

html_meta_data_parse-0.0.32-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file html_meta_data_parse-0.0.32.tar.gz.

File metadata

  • Download URL: html_meta_data_parse-0.0.32.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for html_meta_data_parse-0.0.32.tar.gz
Algorithm Hash digest
SHA256 79f62aa7e908c1619a5080612c436ce4439c5be73c64bf1460aff15b6a049bdc
MD5 938fe99ec340ebe0bcfb075e035a88f3
BLAKE2b-256 366623bd043e8e02b55cd40ad69643dbac30c96389dc7fe6b776d33c27820a44

See more details on using hashes here.

File details

Details for the file html_meta_data_parse-0.0.32-py3-none-any.whl.

File metadata

  • Download URL: html_meta_data_parse-0.0.32-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for html_meta_data_parse-0.0.32-py3-none-any.whl
Algorithm Hash digest
SHA256 9df358a907c90f9458a629245d1755545bd327cc0e78e72be7de1a8377f571b3
MD5 3e1dddcb87d02e96384f892974b965d3
BLAKE2b-256 4396eb39116f620b069b49eb030c47a2549c11f456ed0c503ff8a49fe591b514

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page