Parse Telegram Desktop JSON exports for LLM processing
Project description
tg-parser
Parse Telegram Desktop JSON exports for LLM processing.
Transform messy chat exports into clean, structured data ready for summarization, analysis, and artifact extraction with Claude or other LLMs.
Features
Implemented ✅ (v1.2.0)
- 🗂️ All chat types: Personal, groups, supergroups, forum topics, channels
- 🔍 Powerful filtering: 9 filter types (date, sender, content, topic, attachments, reactions, etc.)
- ✂️ Smart chunking: 3 strategies (fixed, topic, hybrid) for LLM context limits
- 🚀 Streaming: ijson-based reader for files >50MB with auto-detection
- 📝 Multiple formats: Markdown (LLM-optimized), JSON, KB-template, CSV
- 🔌 MCP integration: 6 tools for Claude Desktop/Code
- 📊 Statistics: Message counts, top senders, topics breakdown, mention analysis
- 🎯 tiktoken integration: Accurate token counting (with SimpleTokenCounter fallback)
- 📄 split-topics command: Split forum chats by topic into separate files
- ✅ Type-safe: pyright strict mode, 413 comprehensive tests
- 🔧 mcp-config command: Auto-configure Claude Desktop/Code MCP integration
- 🆕 Config file support: TOML configuration with
configcommand group
Installation
# From PyPI (recommended)
pip install tg-parser
# With uv
uv tool install tg-parser
# With all extras (MCP, tiktoken, streaming)
pip install "tg-parser[all]"
# From source
git clone https://github.com/mdemyanov/tg-parser.git
cd tg-parser
uv sync --all-extras
Quick Start
1. Export from Telegram Desktop
- Open Telegram Desktop
- Go to chat → ⋮ menu → Export chat history
- Select JSON format, uncheck media if not needed
- Export
2. Parse the export
# Basic parsing
tg-parser parse ./ChatExport/result.json -o ./output/
# Last 7 days only
tg-parser parse ./export.json --last-days 7
# Filter by sender
tg-parser parse ./export.json --senders "Иван Петров,Мария"
# Split forum by topics
tg-parser parse ./forum_export.json --split-topics
# Chunk for LLM context limits
tg-parser chunk ./export.json -s hybrid --max-tokens 8000
# Analyze mentions
tg-parser mentions ./export.json --format json
# Large files with streaming
tg-parser parse ./massive_export.json --streaming
# Get statistics
tg-parser stats ./export.json
3. Use with Claude
The output is optimized for LLM processing:
# Chat: Команда разработки
**Период:** 2025-01-13 — 2025-01-19
**Участники:** Иван, Мария, Алексей
---
## 2025-01-15
### 10:30 — Иван Петров
Коллеги, нужно обсудить архитектуру нового модуля.
### 10:35 — Мария Сидорова
@Алексей, подготовь диаграмму к завтра.
CLI Reference
tg-parser parse
Main parsing command with filters.
tg-parser parse <input> [OPTIONS]
# Date filters
--date-from DATE # Start date (YYYY-MM-DD)
--date-to DATE # End date
--last-days N # Last N days
--last-hours N # Last N hours
# Sender filters
--senders TEXT # Include senders (comma-separated)
--exclude-senders TEXT # Exclude senders
# Topic filters (for forum groups)
--topics TEXT # Include topics
--exclude-topics TEXT # Exclude topics
# Content filters
--mentions TEXT # Messages mentioning users
--contains REGEX # Search pattern
--min-length N # Minimum text length
# Type filters
--has-attachment # Only with attachments
--has-reactions # Only with reactions
--exclude-forwards # Exclude forwarded
--include-service # Include service messages
# Output
-o, --output PATH # Output directory
-f, --format FORMAT # markdown|json|csv
--split-topics # Separate file per topic
tg-parser chunk
Split parsed output for LLM context limits.
tg-parser chunk <input> [OPTIONS]
-s, --strategy STRATEGY # fixed|conversation|topic|daily
--max-tokens N # Max tokens per chunk (default: 3000)
--time-gap N # Minutes gap to split (default: 30)
--preserve-threads # Don't break reply chains
tg-parser stats
Chat statistics overview.
tg-parser stats <input> [OPTIONS]
--format FORMAT # table|json|markdown
--top-senders N # Show top N senders
--by-topic # Group by topic
--by-day # Daily breakdown
MCP Server
Use tg-parser directly in Claude Desktop or Claude Code.
Setup
# Auto-configure (recommended)
tg-parser mcp-config --apply
# Or manually add to claude_desktop_config.json:
{
"mcpServers": {
"tg-parser": {
"command": "uvx",
"args": ["tg-parser", "mcp"]
}
}
}
tg-parser mcp-config
Generate or apply MCP configuration for Claude Desktop/Code.
tg-parser mcp-config [OPTIONS]
# Print config to stdout (default)
tg-parser mcp-config
# Apply to Claude Desktop config
tg-parser mcp-config --apply
# Dry run - show what would be applied
tg-parser mcp-config --apply --dry-run
# Apply to Claude Code instead
tg-parser mcp-config --apply --target code
# Use 'uv run' instead of 'uvx'
tg-parser mcp-config --use-uv-run
Options:
--apply Apply config to Claude config file
--dry-run Show what would be written without applying
--no-backup Skip creating backup before modifying
--target [desktop|code] Target application (default: desktop)
--use-uv-run Use 'uv run' instead of 'uvx' for non-venv installs
-v, --verbose Verbose output
Available Tools
| Tool | Description | Status |
|---|---|---|
parse_telegram_export |
Parse JSON export with filters | ✅ |
chunk_telegram_export |
Split messages for LLM context | ✅ |
get_chat_statistics |
Get chat statistics (JSON) | ✅ |
list_chat_participants |
List participants with message counts | ✅ |
list_chat_topics |
List forum topics with message counts | ✅ |
list_mentioned_users |
Analyze @mentions frequency | ✅ |
Example Usage in Claude
User: Parse my team chat from last week and summarize key decisions
Claude: I'll parse the export and prepare it for analysis.
[Uses parse_telegram_export tool with date_from filter]
Based on the parsed chat, here are the key decisions...
Python API
from tg_parser import parse_chat, ChatFilter
from tg_parser.domain.value_objects import FilterSpecification, DateRange
from datetime import datetime, timedelta
# Simple parsing
chat = parse_chat("./export.json")
print(f"Loaded {len(chat.messages)} messages")
# With filters
filter_spec = FilterSpecification(
date_range=DateRange(
start=datetime.now() - timedelta(days=7)
),
senders=frozenset(["Иван Петров"]),
exclude_service=True,
)
chat = parse_chat("./export.json", filter_spec=filter_spec)
# Access data
for topic in chat.topics.values():
msgs = chat.messages_by_topic(topic.id)
print(f"{topic.title}: {len(msgs)} messages")
# Chunking
from tg_parser.application.services.chunker import ConversationChunker
chunker = ConversationChunker(max_tokens=3000)
chunks = chunker.chunk(chat.messages)
Output Formats
Markdown (default)
Clean, human-readable format optimized for LLM comprehension.
JSON
Structured format for programmatic processing:
{
"meta": {
"chat_name": "Team Chat",
"chat_type": "supergroup_forum",
"statistics": {
"total_messages": 127,
"tokens_estimate": 15000
}
},
"messages": [
{
"id": 1234,
"timestamp": "2025-01-15T10:30:00Z",
"author": "Иван Петров",
"text": "...",
"topic": "architecture"
}
]
}
CSV
Tabular format for spreadsheet analysis.
Chunking Strategies
| Strategy | Description | Best For |
|---|---|---|
conversation |
Split by time gaps + size | General use (recommended) |
fixed |
Fixed token count | Simple cases |
topic |
One chunk per topic | Forum groups |
daily |
One chunk per day | Long time periods |
Configuration
tg-parser supports TOML configuration files for setting default options.
Config File Locations (priority order)
--config PATHCLI flagTG_PARSER_CONFIGenvironment variable./tg-parser.toml(current directory)./.tg-parser.toml(current directory, hidden)~/tg-parser.toml(home directory)~/.tg-parser.toml(home directory, hidden)~/.config/tg-parser/config.toml(XDG standard)
Managing Config
# Create example config in current directory
tg-parser config init
# Create in specific location
tg-parser config init -o ~/.tg-parser.toml
# Show current effective config
tg-parser config show -v
# Show all search locations
tg-parser config path
# Use custom config for a command
tg-parser --config myconfig.toml parse export.json
Config File Format
Create ~/.config/tg-parser/config.toml:
[default]
output_format = "markdown" # markdown, kb, json, csv
output_dir = "~/Documents/tg-exports"
[filtering]
exclude_service = true
exclude_empty = true
exclude_forwards = false
min_message_length = 0
[chunking]
strategy = "fixed" # fixed, topic, hybrid
max_tokens = 8000
[output.markdown]
include_extraction_guide = false
no_frontmatter = false
[mentions]
min_count = 1
output_format = "table" # table, json
[stats]
top_senders = 10
CLI arguments always override config file values.
Development
# Clone and setup
git clone https://github.com/example/tg-parser
cd tg-parser
uv sync --all-extras
# Run tests
uv run pytest
# Type check
uv run pyright
# Lint and format
uv run ruff check --fix
uv run ruff format
# Run CLI in dev mode
uv run tg-parser parse ./test.json
Architecture
Clean Architecture with clear separation:
presentation/ → application/ → domain/ ← infrastructure/
(CLI, MCP) (use cases) (entities) (adapters)
Documentation
- CLAUDE.md — AI assistant system prompt and development methodology
- docs/ARCHITECTURE.md — Clean Architecture layers, domain model, design decisions
- docs/DEVELOPMENT.md — Development guide, common tasks, testing guidelines
- docs/TELEGRAM_FORMAT.md — Telegram JSON export format specification
- PRD.md — Product requirements, roadmap, implementation status
- CHANGELOG.md — Version history and release notes
Development Status
Current Version: 1.2.0 (Stable)
| Component | Status | Details |
|---|---|---|
| Core parsing | ✅ Complete | All chat types, topics, reactions |
| Filtering | ✅ Complete | 9 filter types |
| Chunking | ✅ Complete | 3 strategies (fixed, topic, hybrid) |
| Streaming | ✅ Complete | ijson reader, auto-detection >50MB |
| CLI | ✅ Complete | 7 commands: parse, stats, chunk, mentions, split-topics, mcp-config, config |
| MCP Server | ✅ Complete | 6 tools for Claude integration |
| Writers | ✅ Complete | Markdown, JSON, KB-template, CSV |
| Config | ✅ Complete | TOML config files, config command group |
| Tests | ✅ Complete | 413 tests, pyright strict |
| PyPI | ✅ Published | v1.2.0 available |
| CI/CD | ✅ Automated | GitHub Actions for testing & releases |
Roadmap
- v1.0.0: ✅ RELEASED - Production stable, PyPI published, CI/CD automated
- v1.1.0: ✅ RELEASED - CSV output, split-topics command, tiktoken integration
- v1.2.0: ✅ RELEASED - TOML config file support,
configcommand group
See PRD.md for detailed roadmap.
Contributing
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing) - Make changes with tests
- Ensure
uv run pytestanduv run pyrightpass - Submit PR
License
MIT License - see LICENSE for details.
Acknowledgments
- Telegram Desktop for export functionality
- Typer for CLI framework
- MCP for Claude integration
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tg_parser-1.2.0.tar.gz.
File metadata
- Download URL: tg_parser-1.2.0.tar.gz
- Upload date:
- Size: 129.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f84ed98e7bd5f2cb0419df81de3cd4a06013935b8dea3c4f36a4bd0ff7c8c3c3
|
|
| MD5 |
8a95203dac4827eb7347f5d4fd0b9002
|
|
| BLAKE2b-256 |
7af5b42088b0070f6cdfd6b9d2ea62fd7e5c9c71ec87fe317c12741ad76bb15e
|
File details
Details for the file tg_parser-1.2.0-py3-none-any.whl.
File metadata
- Download URL: tg_parser-1.2.0-py3-none-any.whl
- Upload date:
- Size: 89.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
399db27af8ca2364d4378515d24fee777a3212533947216c7b8c4c0ee980f100
|
|
| MD5 |
20e6bc8a962cbc6655af0040db53ccb1
|
|
| BLAKE2b-256 |
1671b16e81d425f14f2a1565b3cf14e004374cdf7aa84d6364c6b7b2e2224ec9
|