Skip to main content
This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Seamlessly extract the creation or modification date of web pages by scraping the HTML code or performing content guesses.

Project Description

Description

Seamless extraction of the creation or modification date of web pages. htmldate provides following ways to date documents, based on HTML parsing and scraping functions:

  1. Starting from the header of the page, it uses common patterns to identify date fields.
  2. If this is not successful, it scans the whole document looking for structural markers.
  3. If no date cue could be found, it finally runs a series of heuristics on the content.

Pull requests are welcome.

Usage

The module takes the HTML document as input (string format) and returns a date when a valid cue could be found in the document. The output string defaults to ISO 8601 YMD format.

According to the tests it should be compatible with all common versions of Python (2 & 3).

Install from package repository: pip install htmldate

Direct installation of the latest version over pip is possible (see build status):

pip install git+https://github.com/adbar/htmldate.git

Within Python

All the functions of the module are currently bundled in htmldate, the examples below use the external module requests.

In case the web page features clear metadata in the header, the extraction is straightforward:

>>> import requests
>>> import htmldate
>>> r = requests.get('r = requests.get('https://www.theguardian.com/politics/2016/feb/17/merkel-eu-uk-germany-national-interest-cameron-justified')

>>> htmldate.find_date(r.text)
'2016-02-17'

A more advanced analysis of the document structure is sometimes needed:

>>> r = requests.get('http://blog.python.org/2016/12/python-360-is-now-available.html')
>>> core.find_date(r.text)
'# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>'
'# DEBUG result: 2016-12-23'
'2016-12-23'

In the worst case, the module resorts to a guess based on an extensive search, which can be deactivated:

>>> r = requests.get('https://creativecommons.org/about/')
>>> htmldate.find_date(r.text)
'2017-08-11'
>>> htmldate.find_date(r.text, False)
>>>

There are however pages for which no date can be found, ever:

>>> r = requests.get('https://example.com')
>>> htmldate.find_date(r.text)
>>>

Command-line

A basic command-line interface is included:

$ wget -qO- "http://blog.python.org/2016/12/python-360-is-now-available.html" | htmldate
2016-12-23

Usage:

$ htmldate --help
htmldate [-h] [-v] [-s]
optional arguments:
    -h, --help     show this help message and exit
    -v, --verbose  increase output verbosity
    -s, --safe     safe mode: markup search only

Additional information

Context

There are webpages for which neither the URL nor the server response provide a reliable way to date the document, i.e. find when it was first published and/or last modified.

This module is part of methods to derive metadata from web documents in order to build text corpora for (computational) linguistic analysis. For more information:

Kudos to…

Further analyses

If the date is nowhere to be found, it might be worth considering carbon dating the web page, however this is computationally expensive.

Contact

See my contact page for details.

Release History

Release History

This version
History Node

0.2.1

History Node

0.2.0

History Node

0.1.2

History Node

0.1.1

History Node

0.1.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
htmldate-0.2.1-py2.py3-none-any.whl (12.4 kB) Copy SHA256 Checksum SHA256 py2.py3 Wheel Sep 11, 2017
htmldate-0.2.1.tar.gz (353.3 kB) Copy SHA256 Checksum SHA256 Source Sep 11, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting