Skip to main content

Seamlessly extract the creation or modification date of web pages by scraping the HTML code or performing content guesses.

Project description

https://img.shields.io/pypi/v/htmldate.svg https://img.shields.io/pypi/l/htmldate.svg https://img.shields.io/pypi/pyversions/htmldate.svg https://img.shields.io/travis/adbar/htmldate.svg

Description

Seamless extraction of the creation or modification date of web pages. htmldate provides following ways to date documents, based on HTML parsing and scraping functions:

  1. Starting from the header of the page, it uses common patterns to identify date fields.

  2. If this is not successful, it scans the whole document looking for structural markers.

  3. If no date cue could be found, it finally runs a series of heuristics on the content (text and markup).

Usage

The module takes the HTML document as input (string format) and returns a date if a valid cue could be found in the document. The output string defaults to ISO 8601 YMD format.

According to the tests it should be compatible with all common versions of Python (2 & 3).

Installation

Install from package repository: pip install htmldate

Direct installation of the latest version over pip is possible (see build status):

pip install git+https://github.com/adbar/htmldate.git

Command-line

A basic command-line interface is included:

$ wget -qO- "http://blog.python.org/2016/12/python-360-is-now-available.html" | htmldate
2016-12-23

Usage:

$ htmldate --help
htmldate [-h] [-v] [-s]
optional arguments:
    -h, --help     show this help message and exit
    -v, --verbose  increase output verbosity
    -s, --safe     safe mode: markup search only

Within Python

All the functions of the module are currently bundled in htmldate, the examples below use the external module requests.

In case the web page features clear metadata in the header, the extraction is straightforward:

>>> import requests
>>> import htmldate
>>> r = requests.get('https://www.theguardian.com/politics/2016/feb/17/merkel-eu-uk-germany-national-interest-cameron-justified')
>>> htmldate.find_date(r.text)
'2016-02-17'

A more advanced analysis of the document structure is sometimes needed:

>>> r = requests.get('http://blog.python.org/2016/12/python-360-is-now-available.html')
>>> htmldate.find_date(r.text)
'# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>'
'# DEBUG result: 2016-12-23'
'2016-12-23'

In the worst case, the module resorts to a guess based on an extensive search, which can be deactivated:

>>> r = requests.get('https://creativecommons.org/about/')
>>> htmldate.find_date(r.text)
'2017-08-11' # has been updated since
>>> htmldate.find_date(r.text, extensive_search=False)
>>>

It is also possible to use already parsed HTML (i.e. a LXML tree object):

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><span class="entry-date">July 12th, 2016</span></body></html>')
>>> htmldate.find_date(mytree)
'2016-07-12'

The output format of the dates found can be set in a format known to Python’s datetime module, the default being %Y-%m-%d:

>>> r = requests.get('https://www.gnu.org/licenses/gpl-3.0.en.html')
>>> htmldate.find_date(r.text)
'2016-11-18'
>>> htmldate.find_date(r.text, outputformat='%d %B %Y')
'18 November 2016'

There are however pages for which no date can be found, ever:

>>> r = requests.get('https://example.com')
>>> htmldate.find_date(r.text)
>>>

Tests

A series of webpages triggering different structural and content patterns is included for testing purposes:

$ python tests/unit_tests.py

Additional information

Context

There are web pages for which neither the URL nor the server response provide a reliable way to date the document, i.e. find when it was first published and/or last modified.

This module is part of methods to derive metadata from web documents in order to build text corpora for (computational) linguistic analysis. For more information:

Kudos to…

Further analyses

If the date is nowhere to be found, it might be worth considering carbon dating the web page, however this is computationally expensive.

Pull requests are welcome.

Contact

See my contact page for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htmldate-0.2.2.tar.gz (678.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

htmldate-0.2.2-py2.py3-none-any.whl (14.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file htmldate-0.2.2.tar.gz.

File metadata

  • Download URL: htmldate-0.2.2.tar.gz
  • Upload date:
  • Size: 678.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for htmldate-0.2.2.tar.gz
Algorithm Hash digest
SHA256 2ddf1518e33a924d0c240fe496deb180b8b3f6b6a6eafe3b6c1d58db8a260efd
MD5 14b0a8b763d072ccf7197ace302729e3
BLAKE2b-256 d7d5fa4c18abcefc8b5b96fdde8ac3be23ddd5dc359c2254e38a13e93fa49d06

See more details on using hashes here.

File details

Details for the file htmldate-0.2.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for htmldate-0.2.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a94b77b65b3f24034a860d3848cfb1b8368faff0db59fa3ebd226f74ce65d707
MD5 8767768d95c71215370fd5569545a38a
BLAKE2b-256 4dfa17f023bee70e65f0f47224eef3682498f975f12f5545966cac0730c2e38a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page