Skip to main content

Parse Telegram Desktop JSON exports for LLM processing

Project description

tg-parser

PyPI version Python 3.11+ License: MIT Tests Ruff Type Checked

Parse Telegram Desktop JSON exports for LLM processing.

Transform messy chat exports into clean, structured data ready for summarization, analysis, and artifact extraction with Claude or other LLMs.

Features

Implemented ✅ (v1.0.0)

  • 🗂️ All chat types: Personal, groups, supergroups, forum topics, channels
  • 🔍 Powerful filtering: 9 filter types (date, sender, content, topic, attachments, reactions, etc.)
  • ✂️ Smart chunking: 3 strategies (fixed, topic, hybrid) for LLM context limits
  • 🚀 Streaming: ijson-based reader for files >50MB with auto-detection
  • 📝 Multiple formats: Markdown (LLM-optimized), JSON, KB-template with YAML frontmatter
  • 🔌 MCP integration: 6 tools for Claude Desktop/Code
  • 📊 Statistics: Message counts, top senders, topics breakdown, mention analysis
  • Type-safe: pyright strict mode, 261 comprehensive tests

Coming Soon 🚧

  • 📄 CSV export: Tabular output format (P2)
  • 🔧 Config files: TOML configuration support (P3)
  • 🎯 tiktoken: Accurate token counting (P2)

Installation

# From PyPI (recommended)
pip install tg-parser

# With uv
uv tool install tg-parser

# With all extras (MCP, tiktoken, streaming)
pip install "tg-parser[all]"

# From source
git clone https://github.com/mdemyanov/tg-parser.git
cd tg-parser
uv sync --all-extras

Quick Start

1. Export from Telegram Desktop

  1. Open Telegram Desktop
  2. Go to chat → ⋮ menu → Export chat history
  3. Select JSON format, uncheck media if not needed
  4. Export

2. Parse the export

# Basic parsing
tg-parser parse ./ChatExport/result.json -o ./output/

# Last 7 days only
tg-parser parse ./export.json --last-days 7

# Filter by sender
tg-parser parse ./export.json --senders "Иван Петров,Мария"

# Split forum by topics
tg-parser parse ./forum_export.json --split-topics

# Chunk for LLM context limits
tg-parser chunk ./export.json -s hybrid --max-tokens 8000

# Analyze mentions
tg-parser mentions ./export.json --format json

# Large files with streaming
tg-parser parse ./massive_export.json --streaming

# Get statistics
tg-parser stats ./export.json

3. Use with Claude

The output is optimized for LLM processing:

# Chat: Команда разработки
**Период:** 2025-01-13 — 2025-01-19  
**Участники:** Иван, Мария, Алексей

---

## 2025-01-15

### 10:30 — Иван Петров
Коллеги, нужно обсудить архитектуру нового модуля.

### 10:35 — Мария Сидорова
@Алексей, подготовь диаграмму к завтра.

CLI Reference

tg-parser parse

Main parsing command with filters.

tg-parser parse <input> [OPTIONS]

# Date filters
--date-from DATE        # Start date (YYYY-MM-DD)
--date-to DATE          # End date
--last-days N           # Last N days
--last-hours N          # Last N hours

# Sender filters
--senders TEXT          # Include senders (comma-separated)
--exclude-senders TEXT  # Exclude senders

# Topic filters (for forum groups)
--topics TEXT           # Include topics
--exclude-topics TEXT   # Exclude topics

# Content filters
--mentions TEXT         # Messages mentioning users
--contains REGEX        # Search pattern
--min-length N          # Minimum text length

# Type filters
--has-attachment        # Only with attachments
--has-reactions         # Only with reactions
--exclude-forwards      # Exclude forwarded
--include-service       # Include service messages

# Output
-o, --output PATH       # Output directory
-f, --format FORMAT     # markdown|json|csv
--split-topics          # Separate file per topic

tg-parser chunk

Split parsed output for LLM context limits.

tg-parser chunk <input> [OPTIONS]

-s, --strategy STRATEGY  # fixed|conversation|topic|daily
--max-tokens N           # Max tokens per chunk (default: 3000)
--time-gap N             # Minutes gap to split (default: 30)
--preserve-threads       # Don't break reply chains

tg-parser stats

Chat statistics overview.

tg-parser stats <input> [OPTIONS]

--format FORMAT          # table|json|markdown
--top-senders N          # Show top N senders
--by-topic               # Group by topic
--by-day                 # Daily breakdown

MCP Server

Use tg-parser directly in Claude Desktop or Claude Code.

Setup

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "tg-parser": {
      "command": "uvx",
      "args": ["tg-parser", "mcp"]
    }
  }
}

Available Tools

Tool Description Status
parse_telegram_export Parse JSON export with filters
chunk_telegram_export Split messages for LLM context
get_chat_statistics Get chat statistics (JSON)
list_chat_participants List participants with message counts
list_chat_topics List forum topics with message counts
list_mentioned_users Analyze @mentions frequency

Example Usage in Claude

User: Parse my team chat from last week and summarize key decisions

Claude: I'll parse the export and prepare it for analysis.
[Uses parse_telegram_export tool with date_from filter]

Based on the parsed chat, here are the key decisions...

Python API

from tg_parser import parse_chat, ChatFilter
from tg_parser.domain.value_objects import FilterSpecification, DateRange
from datetime import datetime, timedelta

# Simple parsing
chat = parse_chat("./export.json")
print(f"Loaded {len(chat.messages)} messages")

# With filters
filter_spec = FilterSpecification(
    date_range=DateRange(
        start=datetime.now() - timedelta(days=7)
    ),
    senders=frozenset(["Иван Петров"]),
    exclude_service=True,
)
chat = parse_chat("./export.json", filter_spec=filter_spec)

# Access data
for topic in chat.topics.values():
    msgs = chat.messages_by_topic(topic.id)
    print(f"{topic.title}: {len(msgs)} messages")

# Chunking
from tg_parser.application.services.chunker import ConversationChunker

chunker = ConversationChunker(max_tokens=3000)
chunks = chunker.chunk(chat.messages)

Output Formats

Markdown (default)

Clean, human-readable format optimized for LLM comprehension.

JSON

Structured format for programmatic processing:

{
  "meta": {
    "chat_name": "Team Chat",
    "chat_type": "supergroup_forum",
    "statistics": {
      "total_messages": 127,
      "tokens_estimate": 15000
    }
  },
  "messages": [
    {
      "id": 1234,
      "timestamp": "2025-01-15T10:30:00Z",
      "author": "Иван Петров",
      "text": "...",
      "topic": "architecture"
    }
  ]
}

CSV

Tabular format for spreadsheet analysis.

Chunking Strategies

Strategy Description Best For
conversation Split by time gaps + size General use (recommended)
fixed Fixed token count Simple cases
topic One chunk per topic Forum groups
daily One chunk per day Long time periods

Configuration

Create ~/.config/tg-parser/config.toml:

[default]
output_format = "markdown"
output_dir = "~/Documents/tg-exports"

[filtering]
exclude_service = true
min_message_length = 0

[chunking]
strategy = "conversation"
max_tokens = 3000
time_gap_minutes = 30

[token_counter]
backend = "tiktoken"  # or "simple" for no deps

Development

# Clone and setup
git clone https://github.com/example/tg-parser
cd tg-parser
uv sync --all-extras

# Run tests
uv run pytest

# Type check
uv run pyright

# Lint and format
uv run ruff check --fix
uv run ruff format

# Run CLI in dev mode
uv run tg-parser parse ./test.json

Architecture

Clean Architecture with clear separation:

presentation/  →  application/  →  domain/  ←  infrastructure/
   (CLI, MCP)     (use cases)    (entities)    (adapters)

Documentation

Development Status

Current Version: 1.0.0 (Stable)

Component Status Details
Core parsing ✅ Complete All chat types, topics, reactions
Filtering ✅ Complete 9 filter types
Chunking ✅ Complete 3 strategies (fixed, topic, hybrid)
Streaming ✅ Complete ijson reader, auto-detection >50MB
CLI ✅ Complete 4 commands: parse, stats, chunk, mentions
MCP Server ✅ Complete 6 tools for Claude integration
Writers ✅ Complete Markdown, JSON, KB-template
Tests ✅ Complete 261 tests, pyright strict
PyPI ✅ Published v1.0.0 available
CI/CD ✅ Automated GitHub Actions for testing & releases

Roadmap

  • v1.0.0: ✅ RELEASED - Production stable, PyPI published, CI/CD automated
  • v1.1.0: CSV output, split-topics command, tiktoken integration

See PRD.md for detailed roadmap.

Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing)
  3. Make changes with tests
  4. Ensure uv run pytest and uv run pyright pass
  5. Submit PR

License

MIT License - see LICENSE for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tg_parser-1.0.0.tar.gz (102.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tg_parser-1.0.0-py3-none-any.whl (70.4 kB view details)

Uploaded Python 3

File details

Details for the file tg_parser-1.0.0.tar.gz.

File metadata

  • Download URL: tg_parser-1.0.0.tar.gz
  • Upload date:
  • Size: 102.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for tg_parser-1.0.0.tar.gz
Algorithm Hash digest
SHA256 052b0ed08a07769980fe61e4b3ec6aa458388c9224f8ee0100195318091015b8
MD5 a28d4aa049f9a439ad614ab77495e51d
BLAKE2b-256 a5e1f5fa3d5923307b2b17cc29cdaf0d90b5bf0a811a65adfa9d6e72bb407e1a

See more details on using hashes here.

File details

Details for the file tg_parser-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: tg_parser-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 70.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for tg_parser-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d18c32ae34b6707e574f7e3e597b5ed3886a75b5aba8bd28fbf54ba8d2f7939
MD5 c84474ccd8486d29535127c92701d3ca
BLAKE2b-256 d74e416b8054502f6a51f6c81140b87969adb8782e47a24d6ff43a1357adedcf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page