Seamlessly extract the date of web pages based on the HTML code in order to determine the creation or modification date.
Project description
Description
Seamless extraction of the creation or modification date of web pages. htmldate provides following ways to date documents, based on HTML parsing and scraping functions:
Starting from the header of the page, it uses common patterns to identify date fields.
If this is not successful, it then scans the whole document.
If no date cue could be found, it finally run a series of heuristics on the content.
Pull requests are welcome.
Usage
The module takes the HTML document as input (string format) and returns a date when a valid cue could be found in the document. The output string defaults to ISO 8601 YMD format.
According to the tests it should be compatible with all common versions of Python (2 & 3).
Install from package repository: pip install htmldate
Direct installation of the latest version over pip is possible (see build status):
pip install git+https://github.com/adbar/htmldate.git
Within Python
All the functions of the module are currently bundled in htmldate, the examples below use the external module requests.
In case the web page features clear metadata, the extraction is straightforward:
>>> import requests
>>> import htmldate
>>> r = requests.get('r = requests.get('https://www.theguardian.com/politics/2016/feb/17/merkel-eu-uk-germany-national-interest-cameron-justified')
>>> htmldate.find_date(r.text)
'2016-02-17'
A more advanced analysis is sometimes needed:
>>> r = requests.get('http://blog.python.org/2016/12/python-360-is-now-available.html')
>>> core.find_date(r.text)
'# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>'
'# DEBUG result: 2016-12-23'
'2016-12-23'
In the worst case, the module resorts to a wild guess:
>>> r = requests.get('https://creativecommons.org/about/')
>>> htmldate.find_date(r.text)
'2016-05-01'
There are however pages for which no date can be found, ever:
>>> r = requests.get('https://example.com')
>>> htmldate.find_date(r.text)
>>>
Command-line
A basic command-line interface is included:
$ wget -qO- "http://blog.python.org/2016/12/python-360-is-now-available.html" | htmldate
2016-12-23
Additional information
Context
There are webpages for which neither the URL nor the server response provide a reliable way to date the document, i.e. find when it was written.
This module is part of methods to derive metadata from web documents in order to build text corpora for (computational) linguistic analysis. For more information:
Barbaresi, Adrien. “Efficient construction of metadata-enhanced web corpora”, Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.
Kudos to…
dateparser (although it’s is still a bit slow)
A few patterns are derived from python-goose, metascraper, newspaper and articleDateExtractor. This module extends them significantly.
Contact
See my contact page for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for htmldate-0.1.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c21523e60b27295c4ecfb91faa8b239fcd77ef5950287e70d1ec2093a8e5b84 |
|
MD5 | dfc5c760e5490de2df4d913fc265ce4e |
|
BLAKE2b-256 | 409644751b7039b9c31b66601f5bb674d04e467aff1c7de23de6cd8264e689e8 |