Extract publication dates from web pages
Project description
A library to extract a publication date from a web page, along with a measure of the accuracy. This was produced as a part of the mediacloud project, in order to accurately extract dates from content.
Quickstart
The date guesser uses both the url and the html to work, and uses some heuristics to decide which of many possible dates might be the best one.
from date_guesser import DateGuesser, Accuracy
guesser = DateGuesser()
# Uses url slugs when available
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/13/some_news.html',
html='<could be anything></could>')
# Returns a namedtuple with three fields
guess.date # datetime.datetime(2017, 10, 13, 0, 0, tzinfo=<UTC>)
guess.accuracy # Accuracy.DATE
guess.method # 'Found /2017/10/13/ in url'
In case there are two trustworthy sources of dates, date_guesser
prefers the more accurate one
html = '''
<html><head>
<meta property="article:published" itemprop="datePublished" content="2017-10-13T04:56:54-04:00" />
</head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
html=html)
guess.date # datetime.datetime(2017, 10, 13, 4, 56, 54, tzinfo=tzoffset(None, -14400))
guess.accuracy is Accuracy.DATETIME # True
But date_guesser
is not led astray by more accurate, less trustworthy sources of information
html = '''
<html><head>
<meta property="og:image" content="foo.com/2016/7/4/whatever.jpg"/>
</head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
html=html)
guess.date # datetime.datetime(2017, 10, 15, 0, 0, tzinfo=<UTC>)
guess.accuracy is Accuracy.PARTIAL # True
Installation
The library is not yet available on PyPI, so installation is via github only for now:
pip install git+https://github.com/mitmedialab/date_guesser
Performance
We benchmarked the accuracy against the wonderful newspaper
library, using one hundred urls gathered from each of four very different topics in the mediacloud
system. This includes blogs and news articles, as well as many urls that have no date (in which case a guess is marked correct only if it returns None
).
Vaccines
date_guesser |
newspaper |
|
---|---|---|
1 days |
57 |
48 |
7 days |
61 |
51 |
15 days |
66 |
53 |
Aadhar Card in India
date_guesser |
newspaper |
|
---|---|---|
1 days |
73 |
44 |
7 days |
74 |
44 |
15 days |
74 |
44 |
Donald Trump in 2017
date_guesser |
newspaper |
|
---|---|---|
1 days |
79 |
60 |
7 days |
83 |
61 |
15 days |
85 |
61 |
Recipes for desserts and chocolate
date_guesser |
newspaper |
|
---|---|---|
1 days |
83 |
65 |
7 days |
85 |
69 |
15 days |
87 |
69 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for date_guesser-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3d830e2a7ef0ada8d9ef4f1746a69560117d80abf004ae71c302d965202240e |
|
MD5 | 55b7db4e538fa6a66063c50d394d63ba |
|
BLAKE2b-256 | d1cd5f2fd6e601b48b52ba2a1715afe2c410d1e0c3479acb98cbb8c8ea2ad352 |