Telegram Log Service — receive ML training logs via HTTP and send real-time alerts through a Telegram bot.
Project description
telegram-log-service
telegram-log-service — a server that receives ML training logs via HTTP and sends real-time alerts through a Telegram bot. Designed to work with messenger-logger-callback.
Architecture
Training Script Telegram Log Service Telegram
┌──────────────────┐ HTTP POST ┌─────────────────────┐ ┌──────────┐
│ MessengerLogger │ ─────────────> │ /api/logs handler │ │ │
│ or Callback │ /api/logs │ ↓ │ Bot API │ Telegram │
│ + heartbeat │ │ global_state │ ──────────> │ Users │
└──────────────────┘ │ ↓ │ │ │
│ alerting → bot │ └──────────┘
│ staleness_checker │
└─────────────────────┘
Flow:
- Training scripts send JSON events (logs, status updates, heartbeats) to
POST /api/logs. - The web handler updates in-memory run state and triggers alerts when appropriate.
- The Telegram bot sends alerts to subscribed users and responds to commands.
- A background staleness checker detects crashed/stalled runs.
Prerequisites
- Python 3.8+
- A Telegram bot token (create one via @BotFather)
Installation
From source (pip)
git clone https://github.com/Riko0/telegram_log_service.git
cd telegram_log_service
pip install .
Configure
cp .env.example .env
# Edit .env and fill in your TELEGRAM_BOT_TOKEN and ADMIN_TELEGRAM_NAME
Run
After installing, the telegram-log-service command is available system-wide:
telegram-log-service
Or using the Python module:
python -m telegram_log_service
Docker
# From the telegram_log_service directory:
chmod +x deploy/docker/build_docker.sh deploy/scripts/startup.sh
./deploy/docker/build_docker.sh
The Docker image installs the package via pip install . and runs telegram-log-service as the entry point. Pass your .env file via --env-file.
Configuration
All settings are via environment variables (or .env file). See .env.example for a complete template.
| Variable | Required | Default | Description |
|---|---|---|---|
TELEGRAM_BOT_TOKEN |
Yes | — | Telegram bot token from BotFather |
WEB_SERVER_HOST |
No | 0.0.0.0 |
Bind address for the HTTP server |
WEB_SERVER_PORT |
No | 5000 |
Port for the HTTP server |
WEB_AUTH_TOKEN |
No | — | If set, /api/logs requires Authorization: Bearer <token> |
STALL_ALERT_THRESHOLD_SECONDS |
No | 1800 |
Seconds without logs before a run is considered stalled |
STALLED_RUN_AUTO_REMOVE_THRESHOLD_SECONDS |
No | 3600 |
Seconds before a stalled run is auto-removed |
HEARTBEAT_STALL_THRESHOLD_SECONDS |
No | 300 |
Stall threshold for runs sending heartbeats (shorter) |
BEST_METRIC_ALERT_COOLDOWN_SECONDS |
No | 300 |
Minimum seconds between best-metric alerts per run |
ADMIN_TELEGRAM_NAME |
No | — | Telegram username (without @) for admin commands |
API
POST /api/logs
Receives training events. Requires Authorization: Bearer <token> header if WEB_AUTH_TOKEN is set.
Required fields:
| Field | Type | Description |
|---|---|---|
project_name |
string | Project identifier |
run_id |
string | Unique run identifier |
event_type |
string | One of: training_started, trainer_log, epoch_ended, training_finished, custom_log, heartbeat |
timestamp |
string | ISO 8601 timestamp |
Optional fields:
| Field | Type | Description |
|---|---|---|
author_username |
string | Who started the run |
trainer_state |
object | Training state (global_step, epoch, is_training, best_metric, etc.) |
logs |
object | Metric key-value pairs (for trainer_log) |
custom_data |
object | Arbitrary data (for custom_log) |
clearml_link |
string | URL to ClearML dashboard for this run |
Any other top-level keys are stored as run metadata.
GET /health
Returns server status:
{"status": "ok", "active_runs": 3}
Bot Commands
User Commands
| Command | Description |
|---|---|
/start |
Register with the bot, auto-subscribe to all runs |
/help |
Show available commands |
/status |
List all active training runs |
/status <project> <run_id> |
Get status of a specific run |
/full_status |
Detailed status for all runs |
/full_status <project> <run_id> |
Detailed status for a specific run |
/subscribe |
Subscribe to all current and future runs |
/subscribe <project> <run_id> |
Subscribe to a specific run |
/unsubscribe |
Unsubscribe from all alerts |
/unsubscribe <project> <run_id> |
Unsubscribe from a specific run |
/list_subscriptions |
List your current subscriptions |
Admin Commands
| Command | Description |
|---|---|
/add_user <username> |
Add a user to the whitelist |
/remove_user <username> |
Remove a user from the whitelist |
/list_users |
List all whitelisted users |
/remove_run <project> <run_id> |
Manually remove a training run |
Alerts
The bot sends alerts to subscribed users when:
| Alert | When |
|---|---|
| Training Started | A new run sends its first training_started event |
| Training Finished | A run sends training_finished |
| Training Stalled | No logs/heartbeats received beyond the threshold |
| Training Resumed | A stalled run starts sending logs again |
| Best Metric Changed | best_metric improves (with cooldown to avoid spam) |
| Run Removed | A stalled run is auto-removed after prolonged inactivity |
If ClearML is detected, alerts include a direct link to the ClearML dashboard.
Heartbeat
When the client library sends heartbeat events (every ~60 seconds by default), the server uses a shorter stall threshold (HEARTBEAT_STALL_THRESHOLD_SECONDS, default 300s) for faster crash detection. Runs without heartbeats use the standard STALL_ALERT_THRESHOLD_SECONDS (default 1800s). This is fully backwards-compatible -- old clients work the same as before.
Data Persistence
- Whitelist, subscribers, user info are saved to JSON files and survive restarts.
- Training run data is saved to
training_data.jsonon every meaningful event (not heartbeats) and restored on startup.
Related Projects
- messenger-logger-callback — the client library that sends training logs to this service.
pip install messenger-logger-callback
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file telegram_log_service-0.1.1.tar.gz.
File metadata
- Download URL: telegram_log_service-0.1.1.tar.gz
- Upload date:
- Size: 20.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
534ff25a57b436719b7e7a41040be1eadd4b6372d26acc559ad249e6722c729a
|
|
| MD5 |
300fd2a8511ce626926af7ca1843fa90
|
|
| BLAKE2b-256 |
a2c8eb04728df05d1a9770e92925ff2c6777586ee37451272779a07f4b511681
|
File details
Details for the file telegram_log_service-0.1.1-py3-none-any.whl.
File metadata
- Download URL: telegram_log_service-0.1.1-py3-none-any.whl
- Upload date:
- Size: 22.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1482d0c7dc96611547fb9c63af93777014b3c2da30fbbbca647d37cde48b9cdb
|
|
| MD5 |
2315eb66b037fc6139259de945653d25
|
|
| BLAKE2b-256 |
23732087c871000ee7329ce24141dfea1b748043c6e85f9a537f47bc37b04890
|