Skip to main content

Intelligent event detection system using semantic embeddings, MinHash LSH deduplication, and HDBSCAN clustering

Project description

Chronicle — Real‑time Event Clustering and Timeline Builder

CI PyPI version Python Versions License: MIT Code style: black

Intelligent news aggregation powered by semantic embeddings, MinHash deduplication, and adaptive clustering algorithms.

Chronicle is a production-ready event detection system that transforms noisy real-time news streams into coherent, clustered timelines. Built for scale and accuracy, it combines modern NLP techniques with robust ML pipelines to automatically discover trending topics and extract signal from noise.

Architecture & Features

🧠 Advanced NLP Pipeline

  • Semantic Embeddings: Sentence-Transformers (all-MiniLM-L6-v2) for high-quality vector representations
  • Intelligent Fallback: Graceful degradation to TF-IDF when GPU resources unavailable
  • MinHash LSH Deduplication: Sub-linear time complexity for near-duplicate detection (85% similarity threshold)
  • Extractive Summarization: TF-IDF weighted sentence extraction for multi-document summaries

🔬 Adaptive Clustering

  • HDBSCAN: Density-based clustering with automatic outlier detection and probabilistic membership scores
  • Agglomerative Fallback: Distance-threshold clustering with cosine similarity for deterministic results
  • Dynamic Event Formation: Minimum cluster size validation ensures signal over noise

🏗️ Production-Ready Design

  • Async I/O: Non-blocking HTTP client for concurrent article fetching
  • Readability Extraction: DOM-based content extraction with BeautifulSoup + lxml parsing
  • SQLite Storage: Zero-config persistence with indexed queries for sub-millisecond lookups
  • FastAPI: High-performance async REST API with automatic OpenAPI documentation
  • Docker Compose: Single-command deployment with isolated collector and API services

Installation

From PyPI (coming soon)

pip install chronicle-events

From GitHub

pip install git+https://github.com/dukeblue1994-glitch/chronicle.git

For Development

git clone https://github.com/dukeblue1994-glitch/chronicle.git
cd chronicle
pip install -e ".[dev]"

Quick Start

Using as a Library

from chronicle.nlp import encode
from chronicle.cluster import deduplicate, cluster_embeddings
from chronicle.timeline import summarize

# Your documents
docs = ["doc 1 text", "doc 2 text", ...]

# Deduplicate
rep_indices = deduplicate(docs, threshold=0.85)

# Embed and cluster
embeddings = encode(docs)
labels, probs = cluster_embeddings(embeddings, min_cluster_size=3)

Running the Full Application

Option 1: Docker Compose (Recommended)

docker-compose up
# API available at http://localhost:8000/docs

Option 2: Command Line Tools

# Install package
pip install chronicle-events

# Start the collector (pulls HN every 60s)
chronicle-collector

# In another terminal, start the API
chronicle-api

# Or run clustering manually
chronicle-cluster

Option 3: From Source

git clone https://github.com/dukeblue1994-glitch/chronicle.git
cd chronicle
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Start the collector
python apps/collector/run.py

# In another terminal, start the API
python apps/api/main.py

# Open API docs at http://127.0.0.1:8000/docs

API Usage

Endpoints

  • GET /events — Current event clusters ranked by size and confidence, with extractive summaries
  • GET /events/{cluster_id} — Detailed event view with all associated documents and metadata
  • GET /health — System health check

Example Response

{
  "cluster_id": "ev-a3f5c9d2e1b8f4a6",
  "n_docs": 7,
  "score": 0.92,
  "summary": "New AI model achieves breakthrough performance...",
  "sample": [
    {"title": "GPT-5 Released", "url": "https://..."},
    {"title": "OpenAI Announces Major Update", "url": "https://..."}
  ]
}

Technical Implementation

Data Pipeline

  1. Ingestion: Async fetcher polls Hacker News API every 60s (top 60 stories)
  2. Extraction: Readability algorithm extracts clean article text from HTML
  3. Deduplication: MinHash LSH identifies and filters near-duplicates in O(n) time
  4. Embedding: Documents encoded to 384-dimensional semantic vectors
  5. Clustering: HDBSCAN groups semantically similar documents with confidence scores
  6. Summarization: TF-IDF ranks sentences across cluster for representative summary

Algorithm Details

  • MinHash: 128 permutations, 4-gram shingling, 0.85 Jaccard similarity threshold
  • HDBSCAN: Min cluster size 3, Euclidean metric, probability-based membership
  • Embeddings: Normalized L2 vectors for cosine similarity clustering
  • Summarization: Top-k sentence selection by aggregate TF-IDF scores

Notes

  • Storage: SQLite at data/chronicle.db for zero-config portability
  • Embeddings: Sentence-Transformers preferred; automatic TF-IDF fallback (4096 features, bigrams)
  • Clustering: HDBSCAN when available; Agglomerative with 0.6 cosine distance threshold as fallback
  • Performance: Batch processing of 400 recent documents with incremental clustering

Data Source

  • Hacker News: Top stories API with full article text extraction when available

Configuration

Chronicle can be configured via environment variables. Copy .env.example to .env and customize:

# Key configuration options
CHRONICLE_COLLECTOR_INTERVAL=60  # Fetch interval in seconds
CHRONICLE_CLUSTER_MIN_SIZE=3     # Minimum documents per cluster
CHRONICLE_DEDUP_THRESHOLD=0.85   # Similarity threshold for deduplication
CHRONICLE_LOG_LEVEL=INFO         # Logging level

See .env.example for all available options.

Development

Setup

git clone https://github.com/dukeblue1994-glitch/chronicle.git
cd chronicle
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Running Tests

pytest                    # Run all tests
pytest --cov             # With coverage
pytest tests/test_api.py # Specific test file

Code Quality

black chronicle apps tests  # Format code
ruff check chronicle apps   # Lint code
mypy chronicle apps         # Type check

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Quick Contribution Guide

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests
  4. Run quality checks (pytest && black . && ruff check .)
  5. Commit (git commit -m 'Add amazing feature')
  6. Push and create a Pull Request

Roadmap

  • Additional data sources (Reddit, Twitter, RSS)
  • Real-time WebSocket API for live updates
  • Event evolution tracking over time
  • Advanced trend detection
  • Grafana dashboards for monitoring
  • Multi-language support

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use Chronicle in your research or project, please cite:

@software{chronicle2025,
  title = {Chronicle: Real-time Event Clustering and Timeline Builder},
  author = {Anderson, Nick},
  year = {2025},
  url = {https://github.com/dukeblue1994-glitch/chronicle}
}

Acknowledgments

  • Sentence-Transformers for embeddings
  • HDBSCAN for density-based clustering
  • MinHash LSH for efficient deduplication
  • FastAPI for the API framework
  • Hacker News for the data source

Built with ❤️ for discovering what's trending in tech

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chronicle_events-0.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chronicle_events-0.1.0-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file chronicle_events-0.1.0.tar.gz.

File metadata

  • Download URL: chronicle_events-0.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for chronicle_events-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63fe28e0d6850b280936a881e43af9b9f87120eb78a73177258f8c7967a53a20
MD5 429bb857c4c733229c5d8ecb038b7fc6
BLAKE2b-256 67d4c1db6903539d57e415ad255ad84ac87de28d154855ed55d44de24df3012e

See more details on using hashes here.

File details

Details for the file chronicle_events-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chronicle_events-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9dd7b8f502f876c23ce52903e97273a34b45412f28d313b13b46753c34befc24
MD5 62702469bf2b9749ec2ee2d5c32d0bf4
BLAKE2b-256 ab8d9c997936a75a66d61c996fd226e727ba0381fb32b7e1e9edc7dcd8993447

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page