Kubera is a tool for annonymizing and extracting traces from from ChatGPT, Claude, etc. usage data
Project description
Kubera: AI Usage Data Extraction and Analysis
Kubera is a comprehensive tool for extracting, anonymizing, and analyzing usage traces from AI platforms including ChatGPT, Claude (web), and Claude Code. It provides standardized trace extraction, token counting using configurable tokenizers, and statistical analysis across different AI platforms.
Features
- 📊 Multi-platform support: Extract data from ChatGPT, Claude web, and Claude Code
- 🔢 Accurate token counting: Uses DeepSeek V3 tokenizer by default (configurable)
- 📈 Comprehensive analytics: Detailed statistics on usage patterns, token distribution, and conversation flows
- 🔄 Standardized output: Consistent CSV trace format across all platforms
- 📋 Rich statistics: JSON output with detailed breakdowns and human-readable summaries
Installation
Option 1: Using uvx (Recommended)
Run without installing:
# ChatGPT trace extraction
uvx --from kubera kubera-chatgpt-extract-trace --input-file conversations.json
# ChatGPT statistics
uvx --from kubera kubera-chatgpt-extract-stats --input-file chatgpt_trace.csv
# Claude Code trace extraction
uvx --from kubera kubera-claude-code-extract-trace --input-file usage.jsonl
# Claude Code statistics
uvx --from kubera kubera-claude-code-extract-stats --input-file claude_code_trace.csv
# Claude Web trace extraction
uvx --from kubera kubera-claude-web-extract-trace --input-file conversations.json
# Claude Web statistics
uvx --from kubera kubera-claude-web-extract-stats --input-file claude_web_trace.csv
Option 2: Traditional Installation
# Clone the repository
git clone https://github.com/project-vajra/kubera.git
cd kubera
# Install dependencies
pip install -e .
# Or install with development dependencies
pip install -e ".[dev]"
Supported Platforms
1. ChatGPT
Extract conversation data from ChatGPT exports including message chains, tokens, and timestamps.
2. Claude Web
Analyze Claude web conversations with response timing analysis and token breakdowns.
3. Claude Code
Extract usage statistics from Claude Code JSONL files with cache efficiency metrics.
Quick Start
1. Export Your Data
ChatGPT Export
-
Go to ChatGPT Settings
-
Click "Export data" button
-
Download and extract the ZIP file to get
conversations.json
Claude Web Export
-
Go to Claude Settings
-
Click "Export data" button
-
Download and extract to get
conversations.json
Claude Code Data
Claude Code automatically stores usage data in ~/.claude/projects/ as JSONL files.
2. Extract Traces
# ChatGPT
python kubera/chatgpt/extract_trace.py --input-file path/to/chatgpt/conversations.json
# Claude Web
python kubera/claude_web/extract_trace.py --input-file path/to/claude_web/conversations.json
# Claude Code
python kubera/claude_code/extract_trace.py --claude-dir ~/.claude
3. Generate Statistics
# ChatGPT analysis
python kubera/chatgpt/extract_stats.py --input-file data/chatgpt_trace.csv
# Claude Web analysis
python kubera/claude_web/extract_stats.py --input-file data/claude_web_trace.csv
# Claude Code analysis
python kubera/claude_code/extract_stats.py --input-file data/claude_code_trace.csv
Detailed Usage
ChatGPT Data Extraction
Trace Extraction
python kubera/chatgpt/extract_trace.py \
--input-file raw_data/chatgpt/conversations.json \
--output-file data/chatgpt_trace.csv \
--tokenizer deepseek-ai/DeepSeek-V3
Output CSV Fields:
session_uuid: Conversation IDmessage_uuid: Unique message identifierparent_uuid: Parent message ID (conversation threading)role: Message sender (user/assistant/system)timestamp: Message creation timetokens: Token count using specified tokenizer
Statistics Generation
python kubera/chatgpt/extract_stats.py \
--input-file data/chatgpt_trace.csv \
--output-file data/stats/chatgpt_stats.json
Statistics Include:
- Overall message/conversation/token counts
- Role distribution (user vs assistant messages)
- Token analysis (averages, distribution by role)
- Conversation patterns (length distribution, duration)
- Conversation token breakdown by role
Claude Web Data Extraction
Trace Extraction
python kubera/claude_web/extract_trace.py \
--input-file raw_data/claude_web/conversations.json \
--output-file data/claude_web_trace.csv \
--tokenizer deepseek-ai/DeepSeek-V3
Output CSV Fields:
session_uuid: Conversation UUIDmessage_uuid: Message UUIDparent_uuid: Parent message (empty for Claude web format)role: Sender (human/assistant)start_timestamp: Message start timestop_timestamp: Message completion timetokens: Token count using specified tokenizer
Statistics Generation
python kubera/claude_web/extract_stats.py \
--input-file data/claude_web_trace.csv \
--output-file data/stats/claude_web_stats.json
Statistics Include:
- All ChatGPT statistics plus:
- Response timing analysis (start/stop timestamps)
- Response time distribution
- Average response times
Claude Code Data Extraction
Trace Extraction
python kubera/claude_code/extract_trace.py \
--claude-dir ~/.claude \
--output-file data/claude_code_trace.csv
Output CSV Fields:
timestamp: Request timestampparentUuid: Parent message UUIDsessionId: Session identifieruuid: Message UUIDinput_tokens: Base input tokenscache_creation_input_tokens: Cache creation tokenscache_read_input_tokens: Cache read tokensoutput_tokens: Response tokenstotal_input_tokens: Sum of all input token types
Statistics Generation
python kubera/claude_code/extract_stats.py \
--input-file data/claude_code_trace.csv \
--output-file data/stats/claude_code_stats.json
Statistics Include:
- Overall request/token statistics
- Cache efficiency metrics
- Session statistics (average requests, tokens, duration)
- Token breakdown by type (input, cache, output)
Configuration Options
Tokenizer Selection
All extraction scripts support configurable tokenizers:
# Use DeepSeek V3 (default)
--tokenizer deepseek-ai/DeepSeek-V3
# Use Llama 3
--tokenizer meta-llama/Meta-Llama-3-8B
# Use GPT-4 tokenizer
--tokenizer gpt-4
# Any HuggingFace tokenizer
--tokenizer <model-name>
Output Customization
# Custom output locations
--output-file /path/to/custom/output.csv
--output-file /path/to/custom/stats.json
# For Claude Code, custom source directory
--claude-dir /custom/claude/directory
Output Formats
Trace CSV Format
Standardized CSV format across all platforms with platform-specific fields:
- Common: session_uuid, message_uuid, role, tokens
- ChatGPT: timestamp, parent_uuid
- Claude Web: start_timestamp, stop_timestamp, parent_uuid (empty)
- Claude Code: timestamp, parentUuid, sessionId, input/output token breakdown
Statistics JSON Format
Comprehensive JSON with nested statistics:
{
"overall": {
"total_messages": 1250,
"total_conversations": 45,
"total_tokens": 125000,
"role_distribution": {"user": 625, "assistant": 625}
},
"conversations": {
"conv-uuid-1": {
"messages": 10,
"total_tokens": 2500,
"tokens_by_role": {"user": 1000, "assistant": 1500},
"duration_minutes": 15.5
}
},
"token_analysis": {...},
"conversation_patterns": {...}
}
Examples
Complete Workflow Example
# 1. Extract ChatGPT data
python kubera/chatgpt/extract_trace.py \
--input-file raw_data/chatgpt/conversations.json \
--output-file data/chatgpt_trace.csv
# 2. Generate statistics
python kubera/chatgpt/extract_stats.py \
--input-file data/chatgpt_trace.csv \
--output-file data/stats/chatgpt_stats.json
# 3. View results
cat data/stats/chatgpt_stats.json
Batch Processing Multiple Platforms
#!/bin/bash
# Extract traces from all platforms
python kubera/chatgpt/extract_trace.py --input-file raw_data/chatgpt/conversations.json
python kubera/claude_web/extract_trace.py --input-file raw_data/claude_web/conversations.json
python kubera/claude_code/extract_trace.py
# Generate statistics for all platforms
python kubera/chatgpt/extract_stats.py --input-file data/chatgpt_trace.csv
python kubera/claude_web/extract_stats.py --input-file data/claude_web_trace.csv
python kubera/claude_code/extract_stats.py --input-file data/claude_code_trace.csv
echo "Analysis complete! Check data/stats/ for results."
Data Privacy
Kubera processes data locally and does not send any information to external servers. The tokenizers are downloaded once and cached locally. All analysis is performed on your machine.
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Support
- 🐛 Bug Reports: GitHub Issues
- 💡 Feature Requests: GitHub Issues
- 📚 Documentation: GitHub Wiki
Roadmap
- Support for additional AI platforms (Anthropic API, OpenAI API)
- Advanced anonymization techniques
- Interactive visualization dashboard
- Automated trend analysis and insights
- Integration with popular data science tools
Built by the Vajra Team for AI usage analytics and research.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kubera-0.0.1.tar.gz.
File metadata
- Download URL: kubera-0.0.1.tar.gz
- Upload date:
- Size: 29.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
186d4a9d9ffefeb8984be2d87aef3e8a1626826fb925e9206868144c5fa29ccb
|
|
| MD5 |
ae60fb0a9c3f714e6feadcb92c4be93d
|
|
| BLAKE2b-256 |
ac945ebec3b7d6bf979aea42be594448ad6322d78768f4c6e0e0f3048d6c3ff4
|
Provenance
The following attestation bundles were made for kubera-0.0.1.tar.gz:
Publisher:
publish_release.yml on project-vajra/kubera
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kubera-0.0.1.tar.gz -
Subject digest:
186d4a9d9ffefeb8984be2d87aef3e8a1626826fb925e9206868144c5fa29ccb - Sigstore transparency entry: 492431418
- Sigstore integration time:
-
Permalink:
project-vajra/kubera@f2a02b4517169bac733caa62b2cda0aa354289b7 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/project-vajra
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_release.yml@f2a02b4517169bac733caa62b2cda0aa354289b7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file kubera-0.0.1-py3-none-any.whl.
File metadata
- Download URL: kubera-0.0.1-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
058d320162fdcf4db8becd287b33d09cea842422fa529ecbb6af0ec45db41bb0
|
|
| MD5 |
cca6fae4c13ba7bde16ad0739977dd2a
|
|
| BLAKE2b-256 |
624ecc5695a411d1a72167aa7497d4cf14c07a2bfebcd477d3f64e51aeb37725
|
Provenance
The following attestation bundles were made for kubera-0.0.1-py3-none-any.whl:
Publisher:
publish_release.yml on project-vajra/kubera
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kubera-0.0.1-py3-none-any.whl -
Subject digest:
058d320162fdcf4db8becd287b33d09cea842422fa529ecbb6af0ec45db41bb0 - Sigstore transparency entry: 492431432
- Sigstore integration time:
-
Permalink:
project-vajra/kubera@f2a02b4517169bac733caa62b2cda0aa354289b7 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/project-vajra
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_release.yml@f2a02b4517169bac733caa62b2cda0aa354289b7 -
Trigger Event:
release
-
Statement type: