Intelligent event detection system using semantic embeddings, MinHash LSH deduplication, and HDBSCAN clustering
Project description
Chronicle — Real‑time Event Clustering and Timeline Builder
Intelligent news aggregation powered by semantic embeddings, MinHash deduplication, and adaptive clustering algorithms.
Chronicle is a production-ready event detection system that transforms noisy real-time news streams into coherent, clustered timelines. Built for scale and accuracy, it combines modern NLP techniques with robust ML pipelines to automatically discover trending topics and extract signal from noise.
Architecture & Features
🧠 Advanced NLP Pipeline
- Semantic Embeddings: Sentence-Transformers (
all-MiniLM-L6-v2) for high-quality vector representations - Intelligent Fallback: Graceful degradation to TF-IDF when GPU resources unavailable
- MinHash LSH Deduplication: Sub-linear time complexity for near-duplicate detection (85% similarity threshold)
- Extractive Summarization: TF-IDF weighted sentence extraction for multi-document summaries
🔬 Adaptive Clustering
- HDBSCAN: Density-based clustering with automatic outlier detection and probabilistic membership scores
- Agglomerative Fallback: Distance-threshold clustering with cosine similarity for deterministic results
- Dynamic Event Formation: Minimum cluster size validation ensures signal over noise
🏗️ Production-Ready Design
- Async I/O: Non-blocking HTTP client for concurrent article fetching
- Readability Extraction: DOM-based content extraction with BeautifulSoup + lxml parsing
- SQLite Storage: Zero-config persistence with indexed queries for sub-millisecond lookups
- FastAPI: High-performance async REST API with automatic OpenAPI documentation
- Docker Compose: Single-command deployment with isolated collector and API services
Installation
From PyPI (coming soon)
pip install chronicle-events
From GitHub
pip install git+https://github.com/dukeblue1994-glitch/chronicle.git
For Development
git clone https://github.com/dukeblue1994-glitch/chronicle.git
cd chronicle
pip install -e ".[dev]"
Quick Start
Using as a Library
from chronicle.nlp import encode
from chronicle.cluster import deduplicate, cluster_embeddings
from chronicle.timeline import summarize
# Your documents
docs = ["doc 1 text", "doc 2 text", ...]
# Deduplicate
rep_indices = deduplicate(docs, threshold=0.85)
# Embed and cluster
embeddings = encode(docs)
labels, probs = cluster_embeddings(embeddings, min_cluster_size=3)
Running the Full Application
Option 1: Docker Compose (Recommended)
docker-compose up
# API available at http://localhost:8000/docs
Option 2: Command Line Tools
# Install package
pip install chronicle-events
# Start the collector (pulls HN every 60s)
chronicle-collector
# In another terminal, start the API
chronicle-api
# Or run clustering manually
chronicle-cluster
Option 3: From Source
git clone https://github.com/dukeblue1994-glitch/chronicle.git
cd chronicle
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Start the collector
python apps/collector/run.py
# In another terminal, start the API
python apps/api/main.py
# Open API docs at http://127.0.0.1:8000/docs
API Usage
Endpoints
GET /events— Current event clusters ranked by size and confidence, with extractive summariesGET /events/{cluster_id}— Detailed event view with all associated documents and metadataGET /health— System health check
Example Response
{
"cluster_id": "ev-a3f5c9d2e1b8f4a6",
"n_docs": 7,
"score": 0.92,
"summary": "New AI model achieves breakthrough performance...",
"sample": [
{"title": "GPT-5 Released", "url": "https://..."},
{"title": "OpenAI Announces Major Update", "url": "https://..."}
]
}
Technical Implementation
Data Pipeline
- Ingestion: Async fetcher polls Hacker News API every 60s (top 60 stories)
- Extraction: Readability algorithm extracts clean article text from HTML
- Deduplication: MinHash LSH identifies and filters near-duplicates in O(n) time
- Embedding: Documents encoded to 384-dimensional semantic vectors
- Clustering: HDBSCAN groups semantically similar documents with confidence scores
- Summarization: TF-IDF ranks sentences across cluster for representative summary
Algorithm Details
- MinHash: 128 permutations, 4-gram shingling, 0.85 Jaccard similarity threshold
- HDBSCAN: Min cluster size 3, Euclidean metric, probability-based membership
- Embeddings: Normalized L2 vectors for cosine similarity clustering
- Summarization: Top-k sentence selection by aggregate TF-IDF scores
Notes
- Storage: SQLite at
data/chronicle.dbfor zero-config portability - Embeddings: Sentence-Transformers preferred; automatic TF-IDF fallback (4096 features, bigrams)
- Clustering: HDBSCAN when available; Agglomerative with 0.6 cosine distance threshold as fallback
- Performance: Batch processing of 400 recent documents with incremental clustering
Data Source
- Hacker News: Top stories API with full article text extraction when available
Configuration
Chronicle can be configured via environment variables. Copy .env.example to .env and customize:
# Key configuration options
CHRONICLE_COLLECTOR_INTERVAL=60 # Fetch interval in seconds
CHRONICLE_CLUSTER_MIN_SIZE=3 # Minimum documents per cluster
CHRONICLE_DEDUP_THRESHOLD=0.85 # Similarity threshold for deduplication
CHRONICLE_LOG_LEVEL=INFO # Logging level
See .env.example for all available options.
Development
Setup
git clone https://github.com/dukeblue1994-glitch/chronicle.git
cd chronicle
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
Running Tests
pytest # Run all tests
pytest --cov # With coverage
pytest tests/test_api.py # Specific test file
Code Quality
black chronicle apps tests # Format code
ruff check chronicle apps # Lint code
mypy chronicle apps # Type check
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Quick Contribution Guide
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests
- Run quality checks (
pytest && black . && ruff check .) - Commit (
git commit -m 'Add amazing feature') - Push and create a Pull Request
Roadmap
- Additional data sources (Reddit, Twitter, RSS)
- Real-time WebSocket API for live updates
- Event evolution tracking over time
- Advanced trend detection
- Grafana dashboards for monitoring
- Multi-language support
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use Chronicle in your research or project, please cite:
@software{chronicle2025,
title = {Chronicle: Real-time Event Clustering and Timeline Builder},
author = {Anderson, Nick},
year = {2025},
url = {https://github.com/dukeblue1994-glitch/chronicle}
}
Acknowledgments
- Sentence-Transformers for embeddings
- HDBSCAN for density-based clustering
- MinHash LSH for efficient deduplication
- FastAPI for the API framework
- Hacker News for the data source
Built with ❤️ for discovering what's trending in tech
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chronicle_events-0.1.0.tar.gz.
File metadata
- Download URL: chronicle_events-0.1.0.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63fe28e0d6850b280936a881e43af9b9f87120eb78a73177258f8c7967a53a20
|
|
| MD5 |
429bb857c4c733229c5d8ecb038b7fc6
|
|
| BLAKE2b-256 |
67d4c1db6903539d57e415ad255ad84ac87de28d154855ed55d44de24df3012e
|
File details
Details for the file chronicle_events-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chronicle_events-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dd7b8f502f876c23ce52903e97273a34b45412f28d313b13b46753c34befc24
|
|
| MD5 |
62702469bf2b9749ec2ee2d5c32d0bf4
|
|
| BLAKE2b-256 |
ab8d9c997936a75a66d61c996fd226e727ba0381fb32b7e1e9edc7dcd8993447
|