Skip to main content

Fast, consensus-based date format inference

Reason this release was yanked:

Superseded by 0.1.4

Project description

fastdateinfer

Fast, consensus-based date format inference written in Rust with Python bindings.

License: MIT Python 3.10+

Why?

The problem: Is 01/02/2025 January 2nd or February 1st?

Library Approach Problem
pandas dayfirst=True hint You must know the format
dateutil Guess per-element Inconsistent results
hidateinfer Consensus voting Correct, but slow

The solution: If your data contains 15/03/2025, we know it's DD/MM/YYYY (15 can't be a month). This insight applies to ALL dates, resolving ambiguous ones like 01/02/2025.

fastdateinfer implements this consensus algorithm in Rust — 270x faster than hidateinfer.

Installation

pip install fastdateinfer

Quick Start

import fastdateinfer

# Infer format from dates
result = fastdateinfer.infer(["15/03/2025", "01/02/2025", "28/12/2025"])
print(result.format)      # %d/%m/%Y
print(result.confidence)  # 1.0

# Just get the format string
fmt = fastdateinfer.infer_format(["2025-01-15", "2025-03-20"])
print(fmt)  # %Y-%m-%d

# Use with pandas
import pandas as pd
dates = ["15/03/2025", "01/02/2025", "28/12/2025"]
fmt = fastdateinfer.infer_format(dates)
df = pd.to_datetime(dates, format=fmt)

Benchmarks

vs hidateinfer (Python)

Tested on 29,351 real-world dates across multiple formats:

Library Time Speedup
fastdateinfer 22.5 ms
hidateinfer 6,075 ms 270x slower

vs pandas / polars

Comparison on synthetic data (DD/MM/YYYY format):

Dates fastdateinfer pandas (explicit) pandas (mixed) Ratio
100 0.05 ms 0.24 ms 0.25 ms 5x faster
1,000 0.48 ms 0.97 ms 1.02 ms 2x faster
10,000 0.74 ms 2.14 ms 2.20 ms 3x faster
100,000 3.39 ms 17.00 ms 17.50 ms 5x faster

Note: fastdateinfer does format inference while pandas just parses a known format. Yet fastdateinfer is faster because it samples intelligently (consensus converges with ~1000 dates).

Scaling

Dates Time Per-date
1,000 0.48 ms 0.48 µs
10,000 0.74 ms 0.07 µs
100,000 3.39 ms 0.03 µs
1,000,000 ~35 ms 0.03 µs

Performance is sublinear due to smart sampling — only ~1000 dates are fully analyzed regardless of input size.

Supported Formats

Format Example Output
European 15/03/2025 %d/%m/%Y
American 03/15/2025 %m/%d/%Y
ISO 8601 2025-03-15 %Y-%m-%d
ISO datetime 2025-03-15T10:30:00 %Y-%m-%dT%H:%M:%S
Month name 15 Mar 2025 %d %b %Y
Month name (full) 15 March 2025 %d %B %Y
Month first Mar 15, 2025 %b %d, %Y
2-digit year 15/03/25 %d/%m/%y
With time 15/03/25 10.30.00 %d/%m/%y %H.%M.%S
Month-year only March, 2025 %B, %Y
Day-month only 15/Mar %d/%b

API Reference

infer(dates, prefer_dayfirst=True, min_confidence=0.0, strict=False)

Infer date format from a list of date strings.

Arguments:

  • dates: List of date strings
  • prefer_dayfirst: Use DD/MM for fully ambiguous dates (default: True)
  • min_confidence: Minimum confidence threshold (default: 0.0)
  • strict: Raise error if any date doesn't match (default: False)

Returns: InferResult with:

  • format: strptime format string
  • confidence: float between 0.0 and 1.0
  • token_types: list of resolved token types
result = fastdateinfer.infer(["01/02/2025", "03/04/2025"], prefer_dayfirst=False)
print(result.format)  # %m/%d/%Y (American format)

infer_format(dates, prefer_dayfirst=True)

Convenience function that returns only the format string.

fmt = fastdateinfer.infer_format(["2025-01-15", "2025-03-20"])
print(fmt)  # %Y-%m-%d

infer_batch(columns, prefer_dayfirst=True)

Infer formats for multiple columns at once.

results = fastdateinfer.infer_batch({
    "transaction_date": ["15/03/2025", "01/02/2025"],
    "created_at": ["2025-01-15T10:30:00", "2025-01-16T14:45:00"],
    "value_date": ["15-Mar-2025", "01-Feb-2025"]
})

for col, result in results.items():
    print(f"{col}: {result.format}")
# transaction_date: %d/%m/%Y
# created_at: %Y-%m-%dT%H:%M:%S
# value_date: %d-%b-%Y

How It Works

  1. Tokenize: Split "15/03/2025" into [15, /, 03, /, 2025]
  2. Constrain: 15 can only be Day (>12), 03 could be Day or Month, 2025 is Year
  3. Vote: Across all dates, count evidence for each position
  4. Resolve: Position 1 has strong Day evidence → Position 2 must be Month
  5. Format: Output %d/%m/%Y

The key insight: consensus converges quickly. Even with 1 million dates, we only need to analyze ~1000 to determine the format with high confidence.

Use Cases

CSV/Data Processing

import pandas as pd
import fastdateinfer

# Read raw data
df = pd.read_csv("data.csv")

# Detect format automatically
fmt = fastdateinfer.infer_format(df["date"].dropna().tolist())

# Parse with detected format
df["date"] = pd.to_datetime(df["date"], format=fmt)

Multi-format Data Pipeline

# Different columns may have different formats
results = fastdateinfer.infer_batch({
    col: df[col].dropna().astype(str).tolist()
    for col in ["date", "value_date", "created_at"]
})

for col, result in results.items():
    df[col] = pd.to_datetime(df[col], format=result.format)

Validation

# Ensure high confidence
result = fastdateinfer.infer(dates, min_confidence=0.9)
if result.confidence < 0.9:
    raise ValueError(f"Low confidence: {result.confidence}")

Comparison

Feature fastdateinfer hidateinfer pandas dateutil
Consensus-based
Speed (10k dates) 0.74 ms 200 ms 2 ms* N/A
Returns strptime format
Batch inference
Type hints
Pure Rust core

*pandas time is for parsing only (you must already know the format)

Building from Source

# Prerequisites
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
pip install maturin

# Clone and build
git clone https://github.com/coledrain/fastdateinfer
cd fastdateinfer
maturin develop --release

# Run tests
cargo test

License

MIT License. See LICENSE for details.

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

Acknowledgments

  • Inspired by hidateinfer
  • Built with PyO3 for Python bindings
  • Built for high-volume data processing pipelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastdateinfer-0.1.0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastdateinfer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (280.8 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

fastdateinfer-0.1.0-cp39-cp39-macosx_11_0_arm64.whl (245.0 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file fastdateinfer-0.1.0.tar.gz.

File metadata

  • Download URL: fastdateinfer-0.1.0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for fastdateinfer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d0905b04fde9a5455e39358f664e6d5c6e5fb82faabf9110ecdb91e5acc2a169
MD5 45a56354d7aace6c9affd4fd6101d5f1
BLAKE2b-256 5600bd43bb4520c0171ab8fe8f4b5df5530d6cf3856b811289c9183c443d9a3e

See more details on using hashes here.

File details

Details for the file fastdateinfer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fastdateinfer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cb1de0321d8c1171915bca085591cbea3b53fbb0097d4f767925f5fbeac52110
MD5 72860f7f46dcfa6474f9e2113090f014
BLAKE2b-256 b998d5b1dcaed51ed51ffb842d0ac93c8a7b4b704abf34c1fc474fde0712c143

See more details on using hashes here.

File details

Details for the file fastdateinfer-0.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fastdateinfer-0.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2db7524d3a6ebe5ce0cbc0362f05937ce71cf67a436a47951e7f91206eefcf2d
MD5 683527f7a432ec5c4338d9cd4f215c85
BLAKE2b-256 8c4f24e6b3f173ecbbc9cc685605f7e1a9eedd3a2ecca3370c61756ce49c7f83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page