Skip to main content

Machine learning approach to identifying date formats

Project description

DateDetective

Introduction

DateDetective is a Python package that takes a machine learning approach to identifying the format of date strings. This tool is useful for many applications like web scraping where the amount of formats used to represent dates is many and there is no need for 100% accuracy.

Compatible date formats

DateDetective's model is trained to predict what combination of Python datetime module format codes would make up a given string representation of a date. Currently the model can identify the following format codes:

Format Code Description Examples
%d Day of the month as zero-padded decimal number 01, 02, ..., 30, 31
%B Month as full text name January, February, March, ..., December
%b Month as abrieviated text name Jan, Feb, Mar, ..., Dec
%m Month as a zero-padded decimal number 01, 02, 03, ..., 12
%Y Year with century as decimal number 1832, 1996, 2002, 2024
%H Hours as zero-padded decimal number (24 hour clock) 00, 01, 02, ..., 22, 23, 24
%I Hours as zero-padded decimal number (12 hour clock) 01, 02, 03, ..., 10, 11, 12
%M Minutes as zero-padded decimal number 00, 01, 02, ..., 58, 59, 60
%S Seconds as zero-padded decimal number 00, 01, 02, ..., 58, 59, 60
%f Microsecond as decimal number, zero-padded to six digits 000000, 000001, ..., 999999
%p AM or PM AM, PM
%Z Time zone name as text UTC, GMT, EAT, EDT
%z Time zone as UTC offset decimal number +0000, -1200, +1000

Installation

Firstly you will need to ensure that the version of PyTorch that is best for you is installed in the Python environment you are using. If possible use CUDA as this should increase date detection speed. Currently PyTorch's website has an install command finder at: https://pytorch.org/get-started/locally/

After PyTorch is installed you can install DateDetective via Pip.

pip install DateDetective

Usage

Import and Initialise

from datefinder import DateFinder
df = DateFinder()

By default DateDetective will use CUDA cores on your GPU (if available) for some of the calculations. If you do not want to use CUDA then initialise DateFinder like this:

df = DateFinder(useCuda=False)

Generate datetime module format string from date string

>>>df.get_format("30/12/2023 12:52:23")
'%d/%m/%Y %H:%M:%S'

Create a datetime object string from date string

>>>df.get_datetime("30/12/2023 12:52:23")
datetime.datetime(2023, 12, 30, 12, 52, 23)

It's important to remember that although DateFinder can be accurate it sometimes gets a date wrong.

How it was trained

Take a look at my repo that is dedicated to the training of the DateDetective models for more information: DateDetective Training GitHub Repo

License

Apache License 2.0. See LICENSE file.

Contact

Rob Salethorn - rob@salethorn.com Repo link - https://github.com/RSalethorn/DateDetective

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datedetective-1.0.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

datedetective-1.0-py3-none-any.whl (2.3 MB view details)

Uploaded Python 3

File details

Details for the file datedetective-1.0.tar.gz.

File metadata

  • Download URL: datedetective-1.0.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for datedetective-1.0.tar.gz
Algorithm Hash digest
SHA256 dca10f34334ee8e6266b7bbe8be5308bde0288ed89103cb7eaadf3fc65756686
MD5 7eb507381b5fc17f4030019ab872818a
BLAKE2b-256 4d9590e9db0836a1eb2166028b519a1e4398d017f6674298be382aec613e1629

See more details on using hashes here.

File details

Details for the file datedetective-1.0-py3-none-any.whl.

File metadata

  • Download URL: datedetective-1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for datedetective-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95460e806a50126e4ea985ce1a718b8054dfbbf32c8cfdee9b6b34fc93ed61ea
MD5 411ac4ec9276dacc7c9e28d40f928940
BLAKE2b-256 cd7ce8ae9c3810afbe4cedf4188eae3e7d1e9a0434c6678275cba400417d447f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page