Skip to main content

Extract publication dates from web pages

Project description

Build Status Coverage

A library to extract a publication date from a web page, along with a measure of the accuracy. This was produced as a part of the mediacloud project, in order to accurately extract dates from content.

Installation

The library is available on PyPI, and may be installed with

pip install date_guesser

Quickstart

The date guesser uses both the url and the html to work, and uses some heuristics to decide which of many possible dates might be the best one.

from date_guesser import guess_date, Accuracy

# Uses url slugs when available
guess = guess_date(url='https://www.nytimes.com/2017/10/13/some_news.html',
                   html='<could be anything></could>')

#  Returns a Guess object with three properties
guess.date      # datetime.datetime(2017, 10, 13, 0, 0, tzinfo=<UTC>)
guess.accuracy  # Accuracy.DATE
guess.method    # 'Found /2017/10/13/ in url'

In case there are two trustworthy sources of dates, date_guesser prefers the more accurate one

html = '''
    <html><head>
    <meta property="article:published" itemprop="datePublished" content="2017-10-13T04:56:54-04:00" />
    </head></html>'''
guess = guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                   html=html)
guess.date  # datetime.datetime(2017, 10, 13, 4, 56, 54, tzinfo=tzoffset(None, -14400))
guess.accuracy is Accuracy.DATETIME  # True

But date_guesser is not led astray by more accurate, less trustworthy sources of information

html = '''
    <html><head>
    <meta property="og:image" content="foo.com/2016/7/4/whatever.jpg"/>
    </head></html>'''
guess = guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                   html=html)
guess.date  # datetime.datetime(2017, 10, 15, 0, 0, tzinfo=<UTC>)
guess.accuracy is Accuracy.PARTIAL  # True

Future Work

Languages

The code does quite poorly on foreign news sources. This page is Ukranian and has a date on it that a non-Ukranian could identify, but it is not extracted:

import requests

guess = guess_date(url='https://www.dw.com/uk/коментар-націоналізм-родом-зі-східної-європи/a-42081385',
                   html=requests.get(url).text)
guess.date  # None
guess.accuracy is Accuracy.NONE  # True
guess.method == 'Did not find anything'  # True

Reckless Mode

We keep track of the accuracy of extracted dates, but we do not keep track of the confidence of extracted dates being accurate. This may be a way to do more tuning given a particular use case. For example, one strategy we do not employ is a regex for all the date patterns we recognize, since that was far too error-prone. Such an approach might be preferable to returning None in certain cases.

Performance

We benchmarked the accuracy against the wonderful newspaper library, using one hundred urls gathered from each of four very different topics in the mediacloud system. This includes blogs and news articles, as well as many urls that have no date (in which case a guess is marked correct only if it returns None).

Vaccines

date_guesser

newspaper

1 days

57

48

7 days

61

51

15 days

66

53

Aadhar Card in India

date_guesser

newspaper

1 days

73

44

7 days

74

44

15 days

74

44

Donald Trump in 2017

date_guesser

newspaper

1 days

79

60

7 days

83

61

15 days

85

61

Recipes for desserts and chocolate

date_guesser

newspaper

1 days

83

65

7 days

85

69

15 days

87

69

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

date_guesser-2.1.4.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

date_guesser-2.1.4-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file date_guesser-2.1.4.tar.gz.

File metadata

  • Download URL: date_guesser-2.1.4.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.3

File hashes

Hashes for date_guesser-2.1.4.tar.gz
Algorithm Hash digest
SHA256 4ad354f447a2c4f4bd65d1882baf9c0aad0bf84b5ec3324bf936c736d095bb93
MD5 a73b347409f6fb00fe338dbe8d77326a
BLAKE2b-256 ba3b1dc91e03e58697e0167145f7f738047105e6901d65072994eff0d8e1980a

See more details on using hashes here.

File details

Details for the file date_guesser-2.1.4-py3-none-any.whl.

File metadata

  • Download URL: date_guesser-2.1.4-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.3

File hashes

Hashes for date_guesser-2.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 18ae2bd52ba4201c093f26822d702c92b610212f5aa2aeb4bc381b96193599cf
MD5 dcc2caa8244a6b0cf621be46e51d9746
BLAKE2b-256 7340e7936042280e0c648acb84ced42b500f28865f1ffc81a842753a5ffd067b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page