Detect stale AI training data before it causes hallucinations

These details have not been verified by PyPI

Project links

Project description

Freshness Detector 🧪

Detect stale AI training data before it causes hallucinations.

The Problem

91% of ML models degrade over time (MIT/Harvard Study, 2023).

One major cause: stale training data. Information that was accurate when captured becomes outdated, but models trained on it don't know this.

Examples:

Medical AI trained on 2022 COVID guidelines (outdated)
Code completion trained on Python 3.8 examples (old syntax)
News summarization trained on 2023 events (stale context)
Financial models trained on pre-2024 market data (irrelevant)

Current solutions:

❌ Retrain on fixed schedules (wasteful or too slow)
❌ Wait for performance degradation (reactive)
❌ Manual data audits (doesn't scale)

The Solution

Freshness Detector uses temporal decay modeling to calculate how "fresh" your training data is right now.

Key features:

🧮 Mathematical decay model - Exponential confidence degradation over time
📊 Multiple decay policies - Different rates for news, science, code, medical data, etc.
🔍 Dataset analysis - Scan entire datasets for stale entries
🐍 Python API + CLI - Use in notebooks or CI/CD pipelines
⚡ Lightweight - No dependencies beyond Python stdlib + dateutil

Installation

pip install freshness-detector

Quick Start

CLI Usage

Calculate freshness of a single data point:

freshness calculate --confidence 0.9 --timestamp "2024-01-01" --topic ai_training

Output:

🧪 Freshness Analysis
==================================================
Initial confidence: 90.0%
Capture timestamp:  2024-01-01
Age:                365.0 days
Topic type:         ai_training
Decay policy:       AI training data
Decay rate (λ):     0.0200 per day
Floor:              15.0%
==================================================
Current confidence: 15.0%
⚠️  WARNING: Data is STALE (< 30% confidence)

Check an entire dataset:

freshness check training_data.json --threshold 0.4 --verbose

List all decay policies:

freshness policies

Python API

Basic usage:

from freshness_detector import calculate_freshness

# Check if data is still fresh
confidence = calculate_freshness(
    initial_confidence=0.95,
    capture_timestamp="2024-06-01",
    topic_type="ai_training"
)

print(f"Current confidence: {confidence:.1%}")
# Output: Current confidence: 45.2%

if confidence < 0.5:
    print("⚠️  Time to retrain!")

Analyze a dataset:

from freshness_detector import check_dataset

results = check_dataset(
    "training_data.json",
    topic_type="ai_training",
    threshold=0.3
)

print(results["summary"])
print(f"Stale entries: {results['stale_entries']}")
print(f"Average confidence: {results['average_confidence']:.1%}")

Batch processing (in-memory):

from freshness_detector import batch_check

data = [
    {"text": "Example 1", "timestamp": "2025-01-01", "confidence": 0.9},
    {"text": "Example 2", "timestamp": "2023-01-01", "confidence": 0.85},
]

results = batch_check(data, threshold=0.5)
print(f"Stale entries: {results['stale_entries']}")
print(f"Stale indices: {results['stale_indices']}")

Custom decay parameters:

from freshness_detector import calculate_freshness

# Use custom decay rate and floor
confidence = calculate_freshness(
    initial_confidence=0.9,
    capture_timestamp="2024-01-01",
    topic_type="ai_training",
    custom_lambda=0.03,  # Faster decay
    custom_floor=0.1     # Lower minimum
)

Decay Policies

Different types of information decay at different rates:

Topic Type	Decay Rate (λ)	Floor	Half-life	Description
`news`	0.10	5%	~7 days	News and current events
`social_media`	0.15	2%	~5 days	Social media trends
`financial`	0.08	10%	~9 days	Market data
`ai_training`	0.02	15%	~35 days	AI/ML best practices
`medical`	0.015	25%	~46 days	Medical guidelines
`code`	0.005	20%	~139 days	Code examples/APIs
`science`	0.002	30%	~347 days	Scientific facts
`legal`	0.001	40%	~693 days	Legal precedents
`history`	0.0	100%	∞	Historical facts

Formula: C(t) = max(floor, C₀ × e^(-λ × t))

Where:

C(t) = Current confidence
C₀ = Initial confidence
λ = Decay rate (lambda_per_day)
t = Time in days
floor = Minimum confidence threshold

Use Cases

1. ML Pipeline Integration

from freshness_detector import batch_check

# Before training
results = batch_check(training_data, threshold=0.5)

if results['stale_entries'] > len(training_data) * 0.1:
    print("⚠️  More than 10% of data is stale!")
    # Trigger data refresh pipeline

2. CI/CD Data Quality Checks

# In your CI pipeline
freshness check data/training_set.json --threshold 0.4
# Exit code 1 if stale entries found

3. Model Retraining Scheduler

from freshness_detector import calculate_freshness
from datetime import datetime

last_training_date = "2024-06-01"
current_conf = calculate_freshness(1.0, last_training_date, "ai_training")

if current_conf < 0.6:
    trigger_retraining()

4. Dataset Documentation

# Generate freshness report for dataset README
results = check_dataset("dataset.json")
print(results["summary"])
# Add to dataset card / model card

Dataset Format

JSON format:

[
  {
    "text": "Training example 1",
    "timestamp": "2025-01-01",
    "confidence": 0.95
  },
  {
    "text": "Training example 2",
    "timestamp": "2024-06-01",
    "confidence": 0.90
  }
]

JSONL format:

{"text": "Example 1", "timestamp": "2025-01-01", "confidence": 0.95}
{"text": "Example 2", "timestamp": "2024-06-01", "confidence": 0.90}

Supported timestamp fields:

timestamp
created_at
date
captured_at
updated_at

Confidence field (optional):

confidence (defaults to 1.0 if not present)

Research Background

This tool is based on research from Infrastructure Observatory on temporal integrity in AI systems.

Key insight: Information has a "half-life" - the time it takes for confidence to drop to 50%. By modeling this decay mathematically, we can predict when training data becomes unreliable.

Academic foundation:

Exponential decay models (physics, chemistry)
Information theory (Shannon entropy)
Temporal data quality (data engineering)

Related work:

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

Development setup:

git clone https://github.com/onlyecho822-source/freshness-detector.git
cd freshness-detector
pip install -e ".[dev]"
pytest

License

MIT License - see LICENSE file for details.

Citation

If you use Freshness Detector in academic work, please cite:

@software{freshness_detector_2025,
  author = {Infrastructure Observatory},
  title = {Freshness Detector: Temporal Decay Modeling for AI Training Data},
  year = {2025},
  url = {https://github.com/onlyecho822-source/freshness-detector}
}

Support

🐛 Bug reports: GitHub Issues
💬 Discussions: GitHub Discussions
📧 Email: research@infrastructure-observatory.org

Roadmap

v0.2.0 (Q1 2026):

Pandas DataFrame support
Visualization tools (decay curves)
Custom policy builder
Integration with popular ML frameworks (HuggingFace, PyTorch)

v0.3.0 (Q2 2026):

Automated retraining recommendations
Cost-benefit analysis (retrain vs. accept degradation)
Multi-source data freshness aggregation

Future:

Real-time monitoring dashboard
Cloud service integration (S3, GCS, Azure)
LLM-specific decay models

Acknowledgments

Built with insights from:

Temporal integrity research in AI systems
Data quality engineering best practices
ML operations (MLOps) community feedback

Special thanks to:

MIT/Harvard research team for model degradation study
Data observability community for freshness metrics
Early adopters and contributors

Research by Infrastructure Observatory

Keeping AI models honest, one timestamp at a time.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freshness_detector-0.1.0.tar.gz (15.9 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

freshness_detector-0.1.0-py3-none-any.whl (13.7 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file freshness_detector-0.1.0.tar.gz.

File metadata

Download URL: freshness_detector-0.1.0.tar.gz
Upload date: Apr 4, 2026
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for freshness_detector-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5dacd04afd3b16041b35bf84b4b5423359810462394057fed6f5f0f5215007fe`
MD5	`fb20a2099611dba6aaa34cb5783ef77b`
BLAKE2b-256	`b7ea6349728cb6f2e6dfc65e8bc88dd0fed7c0947d97f97ce6f25a85f9698391`

See more details on using hashes here.

File details

Details for the file freshness_detector-0.1.0-py3-none-any.whl.

File metadata

Download URL: freshness_detector-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 13.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for freshness_detector-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`caea94887be04a5364043316096f56fd2e94ca3da76a824c311ab5a4503e2625`
MD5	`9133020ffec34512b2e613f9b8ae8f3b`
BLAKE2b-256	`ffe8b9417da64dad5258d5077f925d9de232ed53533abb73a3e380a847321455`

See more details on using hashes here.

freshness-detector 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Freshness Detector 🧪

The Problem

The Solution

Installation

Quick Start

CLI Usage

Python API

Decay Policies

Use Cases

1. ML Pipeline Integration

2. CI/CD Data Quality Checks

3. Model Retraining Scheduler

4. Dataset Documentation

Dataset Format

Research Background

Contributing

License

Citation

Support

Roadmap

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes