Skip to main content

A module to parse metadata out of documents

Project description

MetadataParser is a python module for pulling metadata out of web documents.

It requires BeautifulSoup , and was largely based on Erik River’s opengraph module ( https://github.com/erikriver/opengraph ).

I needed something more aggressive than Erik’s module , so had to fork.

Installation

pip install metadata_parser

Features

  • it pulls as much metadata out of a document as possible

  • you can set a ‘strategy’ for finding metadata ( ie, only accept opengraph or page attributes )

Notes

  1. This requires BeautifulSoup 3 or 4. If it can import bs4 it does, otherwise it tries BeautifulSoup (3)

  2. For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to ‘none’ (the internal pure python) if it can’t load lxml

The default ‘strategy’ is to look in this order:

og,dc,meta,page og = OpenGraph dc = DublinCore meta = metadata page = page elements

You can specify a strategy as a comma-separated list of the above.

The only 2 page elements currently supported are:

<title>VALUE</title> -> metadata[‘page’][‘title’] <link rel=”canonical” href=”VALUE”> -> metadata[‘page’][‘link’]

Usage

From an URL

>>> import metadata_parser
>>> page = metadata_parser.MetadataParser(url="http://www.cnn.com")
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

From HTML

>>> HTML = """<here>"""
>>> page = metadata_parser.MetadataParser(html=HTML)
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_parser-0.4.13.tar.gz (6.3 kB view details)

Uploaded Source

File details

Details for the file metadata_parser-0.4.13.tar.gz.

File metadata

File hashes

Hashes for metadata_parser-0.4.13.tar.gz
Algorithm Hash digest
SHA256 62b0ca29e5f7d83a79544834f276eef7357b67c46489391367a418844bab7e39
MD5 2aa9836c623a395f9a4ba9fcd7fbccf8
BLAKE2b-256 6628ab5cd10f9922118d32e1b89980a55858483f0232fc788cd84c6f1342a358

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page