Detect stale AI training data before it causes hallucinations
Project description
Freshness Detector 🧪
Detect stale AI training data before it causes hallucinations.
The Problem
91% of ML models degrade over time (MIT/Harvard Study, 2023).
One major cause: stale training data. Information that was accurate when captured becomes outdated, but models trained on it don't know this.
Examples:
- Medical AI trained on 2022 COVID guidelines (outdated)
- Code completion trained on Python 3.8 examples (old syntax)
- News summarization trained on 2023 events (stale context)
- Financial models trained on pre-2024 market data (irrelevant)
Current solutions:
- ❌ Retrain on fixed schedules (wasteful or too slow)
- ❌ Wait for performance degradation (reactive)
- ❌ Manual data audits (doesn't scale)
The Solution
Freshness Detector uses temporal decay modeling to calculate how "fresh" your training data is right now.
Key features:
- 🧮 Mathematical decay model - Exponential confidence degradation over time
- 📊 Multiple decay policies - Different rates for news, science, code, medical data, etc.
- 🔍 Dataset analysis - Scan entire datasets for stale entries
- 🐍 Python API + CLI - Use in notebooks or CI/CD pipelines
- ⚡ Lightweight - No dependencies beyond Python stdlib + dateutil
Installation
pip install freshness-detector
Quick Start
CLI Usage
Calculate freshness of a single data point:
freshness calculate --confidence 0.9 --timestamp "2024-01-01" --topic ai_training
Output:
🧪 Freshness Analysis
==================================================
Initial confidence: 90.0%
Capture timestamp: 2024-01-01
Age: 365.0 days
Topic type: ai_training
Decay policy: AI training data
Decay rate (λ): 0.0200 per day
Floor: 15.0%
==================================================
Current confidence: 15.0%
⚠️ WARNING: Data is STALE (< 30% confidence)
Check an entire dataset:
freshness check training_data.json --threshold 0.4 --verbose
List all decay policies:
freshness policies
Python API
Basic usage:
from freshness_detector import calculate_freshness
# Check if data is still fresh
confidence = calculate_freshness(
initial_confidence=0.95,
capture_timestamp="2024-06-01",
topic_type="ai_training"
)
print(f"Current confidence: {confidence:.1%}")
# Output: Current confidence: 45.2%
if confidence < 0.5:
print("⚠️ Time to retrain!")
Analyze a dataset:
from freshness_detector import check_dataset
results = check_dataset(
"training_data.json",
topic_type="ai_training",
threshold=0.3
)
print(results["summary"])
print(f"Stale entries: {results['stale_entries']}")
print(f"Average confidence: {results['average_confidence']:.1%}")
Batch processing (in-memory):
from freshness_detector import batch_check
data = [
{"text": "Example 1", "timestamp": "2025-01-01", "confidence": 0.9},
{"text": "Example 2", "timestamp": "2023-01-01", "confidence": 0.85},
]
results = batch_check(data, threshold=0.5)
print(f"Stale entries: {results['stale_entries']}")
print(f"Stale indices: {results['stale_indices']}")
Custom decay parameters:
from freshness_detector import calculate_freshness
# Use custom decay rate and floor
confidence = calculate_freshness(
initial_confidence=0.9,
capture_timestamp="2024-01-01",
topic_type="ai_training",
custom_lambda=0.03, # Faster decay
custom_floor=0.1 # Lower minimum
)
Decay Policies
Different types of information decay at different rates:
| Topic Type | Decay Rate (λ) | Floor | Half-life | Description |
|---|---|---|---|---|
news |
0.10 | 5% | ~7 days | News and current events |
social_media |
0.15 | 2% | ~5 days | Social media trends |
financial |
0.08 | 10% | ~9 days | Market data |
ai_training |
0.02 | 15% | ~35 days | AI/ML best practices |
medical |
0.015 | 25% | ~46 days | Medical guidelines |
code |
0.005 | 20% | ~139 days | Code examples/APIs |
science |
0.002 | 30% | ~347 days | Scientific facts |
legal |
0.001 | 40% | ~693 days | Legal precedents |
history |
0.0 | 100% | ∞ | Historical facts |
Formula: C(t) = max(floor, C₀ × e^(-λ × t))
Where:
C(t)= Current confidenceC₀= Initial confidenceλ= Decay rate (lambda_per_day)t= Time in daysfloor= Minimum confidence threshold
Use Cases
1. ML Pipeline Integration
from freshness_detector import batch_check
# Before training
results = batch_check(training_data, threshold=0.5)
if results['stale_entries'] > len(training_data) * 0.1:
print("⚠️ More than 10% of data is stale!")
# Trigger data refresh pipeline
2. CI/CD Data Quality Checks
# In your CI pipeline
freshness check data/training_set.json --threshold 0.4
# Exit code 1 if stale entries found
3. Model Retraining Scheduler
from freshness_detector import calculate_freshness
from datetime import datetime
last_training_date = "2024-06-01"
current_conf = calculate_freshness(1.0, last_training_date, "ai_training")
if current_conf < 0.6:
trigger_retraining()
4. Dataset Documentation
# Generate freshness report for dataset README
results = check_dataset("dataset.json")
print(results["summary"])
# Add to dataset card / model card
Dataset Format
JSON format:
[
{
"text": "Training example 1",
"timestamp": "2025-01-01",
"confidence": 0.95
},
{
"text": "Training example 2",
"timestamp": "2024-06-01",
"confidence": 0.90
}
]
JSONL format:
{"text": "Example 1", "timestamp": "2025-01-01", "confidence": 0.95}
{"text": "Example 2", "timestamp": "2024-06-01", "confidence": 0.90}
Supported timestamp fields:
timestampcreated_atdatecaptured_atupdated_at
Confidence field (optional):
confidence(defaults to 1.0 if not present)
Research Background
This tool is based on research from Infrastructure Observatory on temporal integrity in AI systems.
Key insight: Information has a "half-life" - the time it takes for confidence to drop to 50%. By modeling this decay mathematically, we can predict when training data becomes unreliable.
Academic foundation:
- Exponential decay models (physics, chemistry)
- Information theory (Shannon entropy)
- Temporal data quality (data engineering)
Related work:
- MIT/Harvard Study: 91% of ML models degrade over time
- Training on Yesterday's Truth: The Hidden Cost of Stale Data
- Data Freshness in Data Observability
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
Development setup:
git clone https://github.com/onlyecho822-source/freshness-detector.git
cd freshness-detector
pip install -e ".[dev]"
pytest
License
MIT License - see LICENSE file for details.
Citation
If you use Freshness Detector in academic work, please cite:
@software{freshness_detector_2025,
author = {Infrastructure Observatory},
title = {Freshness Detector: Temporal Decay Modeling for AI Training Data},
year = {2025},
url = {https://github.com/onlyecho822-source/freshness-detector}
}
Support
- 🐛 Bug reports: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📧 Email: research@infrastructure-observatory.org
Roadmap
v0.2.0 (Q1 2026):
- Pandas DataFrame support
- Visualization tools (decay curves)
- Custom policy builder
- Integration with popular ML frameworks (HuggingFace, PyTorch)
v0.3.0 (Q2 2026):
- Automated retraining recommendations
- Cost-benefit analysis (retrain vs. accept degradation)
- Multi-source data freshness aggregation
Future:
- Real-time monitoring dashboard
- Cloud service integration (S3, GCS, Azure)
- LLM-specific decay models
Acknowledgments
Built with insights from:
- Temporal integrity research in AI systems
- Data quality engineering best practices
- ML operations (MLOps) community feedback
Special thanks to:
- MIT/Harvard research team for model degradation study
- Data observability community for freshness metrics
- Early adopters and contributors
Research by Infrastructure Observatory
Keeping AI models honest, one timestamp at a time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file freshness_detector-0.1.0.tar.gz.
File metadata
- Download URL: freshness_detector-0.1.0.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5dacd04afd3b16041b35bf84b4b5423359810462394057fed6f5f0f5215007fe
|
|
| MD5 |
fb20a2099611dba6aaa34cb5783ef77b
|
|
| BLAKE2b-256 |
b7ea6349728cb6f2e6dfc65e8bc88dd0fed7c0947d97f97ce6f25a85f9698391
|
File details
Details for the file freshness_detector-0.1.0-py3-none-any.whl.
File metadata
- Download URL: freshness_detector-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
caea94887be04a5364043316096f56fd2e94ca3da76a824c311ab5a4503e2625
|
|
| MD5 |
9133020ffec34512b2e613f9b8ae8f3b
|
|
| BLAKE2b-256 |
ffe8b9417da64dad5258d5077f925d9de232ed53533abb73a3e380a847321455
|