Skip to main content

Parse Telegram Desktop JSON exports for LLM processing

Project description

tg-parser

PyPI version Python 3.11+ License: MIT Tests Ruff Type Checked

Parse Telegram Desktop JSON exports for LLM processing.

Transform messy chat exports into clean, structured data ready for summarization, analysis, and artifact extraction with Claude or other LLMs.

Features

Implemented ✅ (v1.2.0)

  • 🗂️ All chat types: Personal, groups, supergroups, forum topics, channels
  • 🔍 Powerful filtering: 9 filter types (date, sender, content, topic, attachments, reactions, etc.)
  • ✂️ Smart chunking: 3 strategies (fixed, topic, hybrid) for LLM context limits
  • 🚀 Streaming: ijson-based reader for files >50MB with auto-detection
  • 📝 Multiple formats: Markdown (LLM-optimized), JSON, KB-template, CSV
  • 🔌 MCP integration: 6 tools for Claude Desktop/Code
  • 📊 Statistics: Message counts, top senders, topics breakdown, mention analysis
  • 🎯 tiktoken integration: Accurate token counting (with SimpleTokenCounter fallback)
  • 📄 split-topics command: Split forum chats by topic into separate files
  • Type-safe: pyright strict mode, 413 comprehensive tests
  • 🔧 mcp-config command: Auto-configure Claude Desktop/Code MCP integration
  • 🆕 Config file support: TOML configuration with config command group

Installation

# From PyPI (recommended)
pip install tg-parser

# With uv
uv tool install tg-parser

# With all extras (MCP, tiktoken, streaming)
pip install "tg-parser[all]"

# From source
git clone https://github.com/mdemyanov/tg-parser.git
cd tg-parser
uv sync --all-extras

Quick Start

1. Export from Telegram Desktop

  1. Open Telegram Desktop
  2. Go to chat → ⋮ menu → Export chat history
  3. Select JSON format, uncheck media if not needed
  4. Export

2. Parse the export

# Basic parsing
tg-parser parse ./ChatExport/result.json -o ./output/

# Last 7 days only
tg-parser parse ./export.json --last-days 7

# Filter by sender
tg-parser parse ./export.json --senders "Иван Петров,Мария"

# Split forum by topics
tg-parser parse ./forum_export.json --split-topics

# Chunk for LLM context limits
tg-parser chunk ./export.json -s hybrid --max-tokens 8000

# Analyze mentions
tg-parser mentions ./export.json --format json

# Large files with streaming
tg-parser parse ./massive_export.json --streaming

# Get statistics
tg-parser stats ./export.json

3. Use with Claude

The output is optimized for LLM processing:

# Chat: Команда разработки
**Период:** 2025-01-13 — 2025-01-19  
**Участники:** Иван, Мария, Алексей

---

## 2025-01-15

### 10:30 — Иван Петров
Коллеги, нужно обсудить архитектуру нового модуля.

### 10:35 — Мария Сидорова
@Алексей, подготовь диаграмму к завтра.

CLI Reference

tg-parser parse

Main parsing command with filters.

tg-parser parse <input> [OPTIONS]

# Date filters
--date-from DATE        # Start date (YYYY-MM-DD)
--date-to DATE          # End date
--last-days N           # Last N days
--last-hours N          # Last N hours

# Sender filters
--senders TEXT          # Include senders (comma-separated)
--exclude-senders TEXT  # Exclude senders

# Topic filters (for forum groups)
--topics TEXT           # Include topics
--exclude-topics TEXT   # Exclude topics

# Content filters
--mentions TEXT         # Messages mentioning users
--contains REGEX        # Search pattern
--min-length N          # Minimum text length

# Type filters
--has-attachment        # Only with attachments
--has-reactions         # Only with reactions
--exclude-forwards      # Exclude forwarded
--include-service       # Include service messages

# Output
-o, --output PATH       # Output directory
-f, --format FORMAT     # markdown|json|csv
--split-topics          # Separate file per topic

tg-parser chunk

Split parsed output for LLM context limits.

tg-parser chunk <input> [OPTIONS]

-s, --strategy STRATEGY  # fixed|conversation|topic|daily
--max-tokens N           # Max tokens per chunk (default: 3000)
--time-gap N             # Minutes gap to split (default: 30)
--preserve-threads       # Don't break reply chains

tg-parser stats

Chat statistics overview.

tg-parser stats <input> [OPTIONS]

--format FORMAT          # table|json|markdown
--top-senders N          # Show top N senders
--by-topic               # Group by topic
--by-day                 # Daily breakdown

MCP Server

Use tg-parser directly in Claude Desktop or Claude Code.

Setup

# Auto-configure (recommended)
tg-parser mcp-config --apply

# Or manually add to claude_desktop_config.json:
{
  "mcpServers": {
    "tg-parser": {
      "command": "uvx",
      "args": ["tg-parser", "mcp"]
    }
  }
}

tg-parser mcp-config

Generate or apply MCP configuration for Claude Desktop/Code.

tg-parser mcp-config [OPTIONS]

# Print config to stdout (default)
tg-parser mcp-config

# Apply to Claude Desktop config
tg-parser mcp-config --apply

# Dry run - show what would be applied
tg-parser mcp-config --apply --dry-run

# Apply to Claude Code instead
tg-parser mcp-config --apply --target code

# Use 'uv run' instead of 'uvx'
tg-parser mcp-config --use-uv-run

Options:
  --apply               Apply config to Claude config file
  --dry-run             Show what would be written without applying
  --no-backup           Skip creating backup before modifying
  --target [desktop|code]  Target application (default: desktop)
  --use-uv-run          Use 'uv run' instead of 'uvx' for non-venv installs
  -v, --verbose         Verbose output

Available Tools

Tool Description Status
parse_telegram_export Parse JSON export with filters
chunk_telegram_export Split messages for LLM context
get_chat_statistics Get chat statistics (JSON)
list_chat_participants List participants with message counts
list_chat_topics List forum topics with message counts
list_mentioned_users Analyze @mentions frequency

Example Usage in Claude

User: Parse my team chat from last week and summarize key decisions

Claude: I'll parse the export and prepare it for analysis.
[Uses parse_telegram_export tool with date_from filter]

Based on the parsed chat, here are the key decisions...

Python API

from tg_parser import parse_chat, ChatFilter
from tg_parser.domain.value_objects import FilterSpecification, DateRange
from datetime import datetime, timedelta

# Simple parsing
chat = parse_chat("./export.json")
print(f"Loaded {len(chat.messages)} messages")

# With filters
filter_spec = FilterSpecification(
    date_range=DateRange(
        start=datetime.now() - timedelta(days=7)
    ),
    senders=frozenset(["Иван Петров"]),
    exclude_service=True,
)
chat = parse_chat("./export.json", filter_spec=filter_spec)

# Access data
for topic in chat.topics.values():
    msgs = chat.messages_by_topic(topic.id)
    print(f"{topic.title}: {len(msgs)} messages")

# Chunking
from tg_parser.application.services.chunker import ConversationChunker

chunker = ConversationChunker(max_tokens=3000)
chunks = chunker.chunk(chat.messages)

Output Formats

Markdown (default)

Clean, human-readable format optimized for LLM comprehension.

JSON

Structured format for programmatic processing:

{
  "meta": {
    "chat_name": "Team Chat",
    "chat_type": "supergroup_forum",
    "statistics": {
      "total_messages": 127,
      "tokens_estimate": 15000
    }
  },
  "messages": [
    {
      "id": 1234,
      "timestamp": "2025-01-15T10:30:00Z",
      "author": "Иван Петров",
      "text": "...",
      "topic": "architecture"
    }
  ]
}

CSV

Tabular format for spreadsheet analysis.

Chunking Strategies

Strategy Description Best For
conversation Split by time gaps + size General use (recommended)
fixed Fixed token count Simple cases
topic One chunk per topic Forum groups
daily One chunk per day Long time periods

Configuration

tg-parser supports TOML configuration files for setting default options.

Config File Locations (priority order)

  1. --config PATH CLI flag
  2. TG_PARSER_CONFIG environment variable
  3. ./tg-parser.toml (current directory)
  4. ./.tg-parser.toml (current directory, hidden)
  5. ~/tg-parser.toml (home directory)
  6. ~/.tg-parser.toml (home directory, hidden)
  7. ~/.config/tg-parser/config.toml (XDG standard)

Managing Config

# Create example config in current directory
tg-parser config init

# Create in specific location
tg-parser config init -o ~/.tg-parser.toml

# Show current effective config
tg-parser config show -v

# Show all search locations
tg-parser config path

# Use custom config for a command
tg-parser --config myconfig.toml parse export.json

Config File Format

Create ~/.config/tg-parser/config.toml:

[default]
output_format = "markdown"   # markdown, kb, json, csv
output_dir = "~/Documents/tg-exports"

[filtering]
exclude_service = true
exclude_empty = true
exclude_forwards = false
min_message_length = 0

[chunking]
strategy = "fixed"           # fixed, topic, hybrid
max_tokens = 8000

[output.markdown]
include_extraction_guide = false
no_frontmatter = false

[mentions]
min_count = 1
output_format = "table"      # table, json

[stats]
top_senders = 10

CLI arguments always override config file values.

Development

# Clone and setup
git clone https://github.com/example/tg-parser
cd tg-parser
uv sync --all-extras

# Run tests
uv run pytest

# Type check
uv run pyright

# Lint and format
uv run ruff check --fix
uv run ruff format

# Run CLI in dev mode
uv run tg-parser parse ./test.json

Architecture

Clean Architecture with clear separation:

presentation/  →  application/  →  domain/  ←  infrastructure/
   (CLI, MCP)     (use cases)    (entities)    (adapters)

Documentation

Development Status

Current Version: 1.2.0 (Stable)

Component Status Details
Core parsing ✅ Complete All chat types, topics, reactions
Filtering ✅ Complete 9 filter types
Chunking ✅ Complete 3 strategies (fixed, topic, hybrid)
Streaming ✅ Complete ijson reader, auto-detection >50MB
CLI ✅ Complete 7 commands: parse, stats, chunk, mentions, split-topics, mcp-config, config
MCP Server ✅ Complete 6 tools for Claude integration
Writers ✅ Complete Markdown, JSON, KB-template, CSV
Config ✅ Complete TOML config files, config command group
Tests ✅ Complete 413 tests, pyright strict
PyPI ✅ Published v1.2.0 available
CI/CD ✅ Automated GitHub Actions for testing & releases

Roadmap

  • v1.0.0: ✅ RELEASED - Production stable, PyPI published, CI/CD automated
  • v1.1.0: ✅ RELEASED - CSV output, split-topics command, tiktoken integration
  • v1.2.0: ✅ RELEASED - TOML config file support, config command group

See PRD.md for detailed roadmap.

Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing)
  3. Make changes with tests
  4. Ensure uv run pytest and uv run pyright pass
  5. Submit PR

License

MIT License - see LICENSE for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tg_parser-1.2.0.tar.gz (129.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tg_parser-1.2.0-py3-none-any.whl (89.7 kB view details)

Uploaded Python 3

File details

Details for the file tg_parser-1.2.0.tar.gz.

File metadata

  • Download URL: tg_parser-1.2.0.tar.gz
  • Upload date:
  • Size: 129.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for tg_parser-1.2.0.tar.gz
Algorithm Hash digest
SHA256 f84ed98e7bd5f2cb0419df81de3cd4a06013935b8dea3c4f36a4bd0ff7c8c3c3
MD5 8a95203dac4827eb7347f5d4fd0b9002
BLAKE2b-256 7af5b42088b0070f6cdfd6b9d2ea62fd7e5c9c71ec87fe317c12741ad76bb15e

See more details on using hashes here.

File details

Details for the file tg_parser-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: tg_parser-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 89.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for tg_parser-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 399db27af8ca2364d4378515d24fee777a3212533947216c7b8c4c0ee980f100
MD5 20e6bc8a962cbc6655af0040db53ccb1
BLAKE2b-256 1671b16e81d425f14f2a1565b3cf14e004374cdf7aa84d6364c6b7b2e2224ec9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page