Skip to main content

Extract publication dates from web pages

Project description

Build Status Coverage

A library to extract a publication date from a web page, along with a measure of the accuracy. This was produced as a part of the mediacloud project, in order to accurately extract dates from content.

Quickstart

The date guesser uses both the url and the html to work, and uses some heuristics to decide which of many possible dates might be the best one.

from date_guesser import DateGuesser, Accuracy

guesser = DateGuesser()

# Uses url slugs when available
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/13/some_news.html',
                           html='<could be anything></could>')

#  Returns a namedtuple with three fields
guess.date      # datetime.datetime(2017, 10, 13, 0, 0, tzinfo=<UTC>)
guess.accuracy  # Accuracy.DATE
guess.method    # 'Found /2017/10/13/ in url'

In case there are two trustworthy sources of dates, date_guesser prefers the more accurate one

html = '''
    <html><head>
    <meta property="article:published" itemprop="datePublished" content="2017-10-13T04:56:54-04:00" />
    </head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                           html=html)
guess.date  # datetime.datetime(2017, 10, 13, 4, 56, 54, tzinfo=tzoffset(None, -14400))
guess.accuracy is Accuracy.DATETIME  # True

But date_guesser is not led astray by more accurate, less trustworthy sources of information

html = '''
    <html><head>
    <meta property="og:image" content="foo.com/2016/7/4/whatever.jpg"/>
    </head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                           html=html)
guess.date  # datetime.datetime(2017, 10, 15, 0, 0, tzinfo=<UTC>)
guess.accuracy is Accuracy.PARTIAL  # True

Installation

The library is not yet available on PyPI, so installation is via github only for now:

pip install git+https://github.com/mitmedialab/date_guesser

Performance

We benchmarked the accuracy against the wonderful newspaper library, using one hundred urls gathered from each of four very different topics in the mediacloud system. This includes blogs and news articles, as well as many urls that have no date (in which case a guess is marked correct only if it returns None).

Vaccines

date_guesser

newspaper

1 days

57

48

7 days

61

51

15 days

66

53

Aadhar Card in India

date_guesser

newspaper

1 days

73

44

7 days

74

44

15 days

74

44

Donald Trump in 2017

date_guesser

newspaper

1 days

79

60

7 days

83

61

15 days

85

61

Recipes for desserts and chocolate

date_guesser

newspaper

1 days

83

65

7 days

85

69

15 days

87

69

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

date_guesser-1.0.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

date_guesser-1.0.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file date_guesser-1.0.0.tar.gz.

File metadata

  • Download URL: date_guesser-1.0.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for date_guesser-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f6389bf9b218871605a00ccfd70c43727a8564b5f3dea90058c28a48be0cb602
MD5 bfde2bcac714eb69ccad069457b67fd3
BLAKE2b-256 84abe3b2e1fae0e9cbca0e4809b4678177322a6f24b9780ed4b743cfc65c3efc

See more details on using hashes here.

File details

Details for the file date_guesser-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for date_guesser-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3d830e2a7ef0ada8d9ef4f1746a69560117d80abf004ae71c302d965202240e
MD5 55b7db4e538fa6a66063c50d394d63ba
BLAKE2b-256 d1cd5f2fd6e601b48b52ba2a1715afe2c410d1e0c3479acb98cbb8c8ea2ad352

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page