A tool to process Telegram chat exports into an AI-friendly format, inspired by Repomix.
Project description
💬 TGMix
TGMix is a powerful tool that processes your Telegram chat export into an AI-friendly dataset. Perfect for feeding the full context of long and complex conversations to Large Language Models (LLMs) like Claude, Gemini, GPT-4o, and more. Inspired by Repomix.
🛠️ Beta Version Note
Please note that TGMix is currently in a beta phase. This means that while the core features are functional, you may encounter occasional bugs or unexpected behavior with certain edge cases.
Your feedback is invaluable during this stage. If you find any issues, please report them on GitHub Issues.
Core features
- Save Costs & Time: Processing a chat history that's up to 3x smaller in token count directly translates to lower API costs for paid models (like GPT-4o and Claude 3 Opus) and significantly faster response times.
- Fit More Data into Context: The token reduction is crucial for fitting large chat histories into a model's limited context window — something that is frequently impossible with raw Telegram exports.
- Higher Quality Analysis: By stitching fragmented messages, you provide the LLM with a more natural and complete context. This prevents misinterpretations and leads to more accurate and insightful summaries, analyses, or role-playing sessions.
- Efficient Output Format: TGMix uses the TOON format instead of JSON. This provides a 30-60% token reduction(as documentation says) over standard JSON, directly cutting API costs and speeding up LLM processing. Benchmarks also show that models retrieve data more accurately from TOON.
- Flexible Anonymization: Automatically mask sensitive data like phone numbers, emails, author names, and custom patterns to protect privacy before sharing chat data.
- Data for RAG & Fine-Tuning: The clean, structured output is a perfect dataset for advanced applications. Use it to build a knowledge base for Retrieval-Augmented Generation (RAG) or to fine-tune a custom model on a specific person's conversational style.
Roadmap
The development of TGMix is planned in stages. Here is what's available now and what to expect in the future.
Current Version
- Significant Token Reduction: By simplifying the structure and removing redundant metadata from the original Telegram export, TGMix reduces the final token count by up to 3 times.
- Message Stitching: Automatically combines messages sent by the same user in quick succession into a single, coherent entry.
- Media Marking: Uses MarkMyMedia-LLM to automatically add filenames to media like videos and voice messages, improving context for AI analysis.
- AI-Ready Toon Output: Produces a single, clean
tgmix_output.toon.txtfile in the TOON (Token-Oriented Object Notation) format, which is more token-efficient than JSON and specifically designed for LLMs. - Advanced Anonymization: A flexible system for masking sensitive data.
Planned for Future Releases
- Advanced Media Processing: Optional conversion of voice/video messages into text via transcription.
- Official Package Manager Support: Easy installation via AUR.
Requirements
- Python 3.10+
Installation
Via PyPI
pip install -U tgmix
From GitHub (For development)
Install tgmix directly from this repository:
pip install -U git+https://github.com/damnkrat/tgmix.git
How to Use
Step 1: Export Your Telegram Chat
- Open Telegram Desktop.
- Go to the chat you want to export.
- Click the three dots (⋮) in the top-right corner and select Export chat history.
- Crucially, in the export settings:
- Set the format to "Machine-readable JSON".
- Choose a date range and media settings as desired.
- Let the export process complete. You will get a folder containing a
result.jsonfile and media subfolders.
Step 2: Run TGMix
-
Navigate to your exported chat directory in your terminal.
cd path/to/your/telegram_export
-
(Optional) Create a local configuration file.
tgmix --initThis will create a
tgmix_config.jsonfile. You can edit it if your export has non-standard file names. -
Run the processor.
tgmix
To enable anonymization, add the
-aflag:tgmix -a
Step 3: Use the Output
Once finished, you will find:
tgmix_output.toon.txt: The final, processed file in Toon format, ready for your LLM.tgmix_media/: A new folder containing all processed and copied media files.
Configuration
You can control TGMix's behavior by editing the tgmix_config.json file. Create one in your project directory by running tgmix --init.
Anonymization Settings
The core of the new functionality lies in the anonymization settings. Here is an example of a fully configured setup:
{
"export_json_file": "result.json",
"media_output_dir": "tgmix_media",
"final_output_file": "tgmix_output.toon.txt",
"anonymize": true,
"default_phone_region": "RU",
"mask_presets": {
"phone": "[PHONE]",
"email": "[EMAIL]",
"authors": "[AUTHOR]"
},
"mask_literals": {
"Project Capture": "[SECRET_PROJECT]"
},
"mask_regex": {
"\\b\\d{16}\\b": "[CARD_NUMBER]"
}
}
"anonymize":trueorfalse. The master switch for the anonymization feature."default_phone_region": A two-letter country code (e.g., "RU", "US", "GB") to help detect local phone numbers that don't have a+prefix. This is a tradeoff for accuracy."mask_presets": A dictionary of built-in, ready-to-use masking rules. The key is the preset name, and the value is the placeholder text that will replace the found data. Available presets are:"phone": Finds and replaces phone numbers. It intelligently detects both international formats (e.g.,+1-541-754-3010) and local formats based on the"default_phone_region"(e.g.,8 (999) 123-45-67)."email": Finds and replaces email addresses using a robust, tested regular expression."authors": This is a special preset that anonymizes the authors themselves. It modifies theauthor_mapin the final output, replacing the author's name with the provided template, and a unique number (e.g.,[AUTHOR_1],[AUTHOR_2]).
"mask_literals": A dictionary for replacing exact, case-insensitive phrases. The key is the phrase to find, and the value is its replacement. Perfect for redacting names, project titles, or other specific keywords."mask_regex": A dictionary for replacing content based on regular expressions. The key is the regex pattern, and the value is its replacement. This gives you the power to mask any custom data format, like ID numbers, bank accounts, or tracking codes.Important: In JSON format, the backslash
\is an escape character. Therefore, you must escape your backslashes in regex patterns. For example, to match a digit (\d), you must write it as\\din thetgmix_config.jsonfile.
Command-Line Overrides and Examples
While tgmix_config.json is great for setting up your default anonymization rules, you can easily override or supplement them for a specific run using command-line flags. CLI flags always take precedence over the configuration file.
Command-Line Usage
You can control TGMix directly from the command line. Flags and options provided here will always take precedence over the settings in tgmix_config.json.
Options
path(Positional Argument) The path to the directory containing the Telegram export. If omitted, TGMix will process the current directory.--initCreates atgmix_config.jsonconfiguration file in the current directory from a built-in template.--versionDisplays the installed version of TGMix and exits.-a,--anonymizeEnables the anonymization feature for the current run.--no-statsDisables the computation and printing of processing statistics at the end of the run.--no-mark-mediaDisables media marking. Files will be copied to the output directory without changes.--mask-preset <preset1> <preset2> ...Overrides the list of active presets from the config file. Only the presets you list here will be used.--mask-literal "phrase:replacement"Overrides themask_literalsdictionary from your config file. You can provide multiple rules.--mask-regex "pattern:replacement"Overrides themask_regexdictionary from your config file. You can provide multiple rules.
Example 1: Enabling Anonymization with Default Rules
If your "anonymize" key is set to false in the config, you can easily enable it for a single run:
tgmix --anonymize
- What it does: This command enables the anonymization feature and uses all the rules exactly as they are defined in your
tgmix_config.jsonfile.
Example 2: Using Only Specific Presets
Imagine you only want to anonymize authors for a particular analysis, ignoring other presets from your config.
tgmix --anonymize --mask-preset authors
- What it does: This command overrides the
mask_presetsfrom your config. It will only anonymize author names in theauthor_map, ignoring thephoneandemailpresets even if they are present in the JSON file. - Note: Authors are not anonymized in message text. The reason is author name can be, for example, just "a". It will just break most messages in the export. To anonymize it in message text, you need to add it to
mask_literalsormask_regex.
Example 3: Adding a One-Time Redaction Rule
You need to redact a new project name just once, without permanently editing your config file.
tgmix --anonymize --mask-literal "Project Capture:[SECRET_PROJECT]"
- What it does: This command uses all the default rules from your config (
presets,regex) but overrides themask_literalsdictionary, adding a new, temporary rule to mask "Project Hydra".
Example 4: The Power User Command
This command demonstrates full control from the CLI, overriding all rule types for a highly specific task.
tgmix --anonymize \
--mask-preset email phone \
--mask-literal "damnkrat:[USER_A]" "secret password:[REDACTED]" \
--mask-regex "\b\d{4}-\d{4}\b:[ID_CODE]"
- What it does:
--anonymize: Enables the feature.--mask-preset email phone: Uses only email and phone presets.--mask-literal ...: Overrides all literals from the config and uses only the two provided here.--mask-regex ...: Overrides all regex rules from the config and uses only the one for a custom ID code.
License
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tgmix-0.9.1.tar.gz.
File metadata
- Download URL: tgmix-0.9.1.tar.gz
- Upload date:
- Size: 58.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92db074237455ac8b59b8045cc8d093aa621f3ed8ca1c8afe69e356b449167fa
|
|
| MD5 |
02705b58705cffaba879711a0c92f695
|
|
| BLAKE2b-256 |
fa85db1553d06d49eb5ec863837fe85e7d8fb3104bda23d0ae35e61b5ef26a47
|
Provenance
The following attestation bundles were made for tgmix-0.9.1.tar.gz:
Publisher:
python-publish.yml on damnkrat/tgmix
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tgmix-0.9.1.tar.gz -
Subject digest:
92db074237455ac8b59b8045cc8d093aa621f3ed8ca1c8afe69e356b449167fa - Sigstore transparency entry: 685038759
- Sigstore integration time:
-
Permalink:
damnkrat/tgmix@8dc751e06e58f1be6e1715ac42f0f18eb0d32574 -
Branch / Tag:
refs/tags/0.9.1 - Owner: https://github.com/damnkrat
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8dc751e06e58f1be6e1715ac42f0f18eb0d32574 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tgmix-0.9.1-py3-none-any.whl.
File metadata
- Download URL: tgmix-0.9.1-py3-none-any.whl
- Upload date:
- Size: 43.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef2f98f523f8874fd5c2d735a12c52ada33a4b82727975a24a5bfcb8a0faaa25
|
|
| MD5 |
74a24b4b1858195f04140d10598fc35f
|
|
| BLAKE2b-256 |
770a6d110e87a72c01da5b76eea5a06ea9cb45ef5e67f14f87cec653e118df37
|
Provenance
The following attestation bundles were made for tgmix-0.9.1-py3-none-any.whl:
Publisher:
python-publish.yml on damnkrat/tgmix
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tgmix-0.9.1-py3-none-any.whl -
Subject digest:
ef2f98f523f8874fd5c2d735a12c52ada33a4b82727975a24a5bfcb8a0faaa25 - Sigstore transparency entry: 685038763
- Sigstore integration time:
-
Permalink:
damnkrat/tgmix@8dc751e06e58f1be6e1715ac42f0f18eb0d32574 -
Branch / Tag:
refs/tags/0.9.1 - Owner: https://github.com/damnkrat
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8dc751e06e58f1be6e1715ac42f0f18eb0d32574 -
Trigger Event:
release
-
Statement type: