date-guesser

Extract publication dates from web pages

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.6

Project description

A library to extract a publication date from a web page, along with a measure of the accuracy. This was produced as a part of the mediacloud project, in order to accurately extract dates from content.

Installation

The library is available on PyPI, and may be installed with

pip install date_guesser

Quickstart

The date guesser uses both the url and the html to work, and uses some heuristics to decide which of many possible dates might be the best one.

from date_guesser import guess_date, Accuracy

# Uses url slugs when available
guess = guess_date(url='https://www.nytimes.com/2017/10/13/some_news.html',
                   html='<could be anything></could>')

#  Returns a Guess object with three properties
guess.date      # datetime.datetime(2017, 10, 13, 0, 0, tzinfo=<UTC>)
guess.accuracy  # Accuracy.DATE
guess.method    # 'Found /2017/10/13/ in url'

In case there are two trustworthy sources of dates, date_guesser prefers the more accurate one

html = '''
    <html><head>
    <meta property="article:published" itemprop="datePublished" content="2017-10-13T04:56:54-04:00" />
    </head></html>'''
guess = guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                   html=html)
guess.date  # datetime.datetime(2017, 10, 13, 4, 56, 54, tzinfo=tzoffset(None, -14400))
guess.accuracy is Accuracy.DATETIME  # True

But date_guesser is not led astray by more accurate, less trustworthy sources of information

html = '''
    <html><head>
    <meta property="og:image" content="foo.com/2016/7/4/whatever.jpg"/>
    </head></html>'''
guess = guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                   html=html)
guess.date  # datetime.datetime(2017, 10, 15, 0, 0, tzinfo=<UTC>)
guess.accuracy is Accuracy.PARTIAL  # True

Future Work

Languages

The code does quite poorly on foreign news sources. This page is Ukranian and has a date on it that a non-Ukranian could identify, but it is not extracted:

import requests

guess = guess_date(url='https://www.dw.com/uk/коментар-націоналізм-родом-зі-східної-європи/a-42081385',
                   html=requests.get(url).text)
guess.date  # None
guess.accuracy is Accuracy.NONE  # True
guess.method == 'Did not find anything'  # True

Reckless Mode

We keep track of the accuracy of extracted dates, but we do not keep track of the confidence of extracted dates being accurate. This may be a way to do more tuning given a particular use case. For example, one strategy we do not employ is a regex for all the date patterns we recognize, since that was far too error-prone. Such an approach might be preferable to returning None in certain cases.

Performance

We benchmarked the accuracy against the wonderful newspaper library, using one hundred urls gathered from each of four very different topics in the mediacloud system. This includes blogs and news articles, as well as many urls that have no date (in which case a guess is marked correct only if it returns None).

Vaccines

	date_guesser	newspaper
1 days	57	48
7 days	61	51
15 days	66	53

Aadhar Card in India

	date_guesser	newspaper
1 days	73	44
7 days	74	44
15 days	74	44

Donald Trump in 2017

	date_guesser	newspaper
1 days	79	60
7 days	83	61
15 days	85	61

Recipes for desserts and chocolate

	date_guesser	newspaper
1 days	83	65
7 days	85	69
15 days	87	69

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.6

Release history Release notifications | RSS feed

This version

2.1.4

Aug 13, 2019

2.1.3

Aug 2, 2019

2.1.2

Aug 2, 2019

2.1.1

Jan 27, 2018

2.1.0

Jan 27, 2018

2.0.0

Jan 25, 2018

1.1.0

Jan 16, 2018

1.0.0

Jan 16, 2018

0.0.1

Jan 16, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

date_guesser-2.1.4.tar.gz (11.7 kB view details)

Uploaded Aug 13, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

date_guesser-2.1.4-py3-none-any.whl (10.3 kB view details)

Uploaded Aug 13, 2019 Python 3

File details

Details for the file date_guesser-2.1.4.tar.gz.

File metadata

Download URL: date_guesser-2.1.4.tar.gz
Upload date: Aug 13, 2019
Size: 11.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.3

File hashes

Hashes for date_guesser-2.1.4.tar.gz
Algorithm	Hash digest
SHA256	`4ad354f447a2c4f4bd65d1882baf9c0aad0bf84b5ec3324bf936c736d095bb93`
MD5	`a73b347409f6fb00fe338dbe8d77326a`
BLAKE2b-256	`ba3b1dc91e03e58697e0167145f7f738047105e6901d65072994eff0d8e1980a`

See more details on using hashes here.

File details

Details for the file date_guesser-2.1.4-py3-none-any.whl.

File metadata

Download URL: date_guesser-2.1.4-py3-none-any.whl
Upload date: Aug 13, 2019
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.3

File hashes

Hashes for date_guesser-2.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18ae2bd52ba4201c093f26822d702c92b610212f5aa2aeb4bc381b96193599cf`
MD5	`dcc2caa8244a6b0cf621be46e51d9746`
BLAKE2b-256	`7340e7936042280e0c648acb84ced42b500f28865f1ffc81a842753a5ffd067b`

See more details on using hashes here.

date-guesser 2.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Quickstart

Future Work

Languages

Reckless Mode

Performance

Vaccines

Aadhar Card in India

Donald Trump in 2017

Recipes for desserts and chocolate

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes