Kubera is a tool for annonymizing and extracting traces from from ChatGPT, Claude, etc. usage data

These details have been verified by PyPI

Project links

Owner

Project Vajra

GitHub Statistics

These details have not been verified by PyPI

Project description

Kubera: AI Usage Data Extraction and Analysis

Kubera is a comprehensive tool for extracting, anonymizing, and analyzing usage traces from AI platforms including ChatGPT, Claude (web), and Claude Code. It provides standardized trace extraction, token counting using configurable tokenizers, and statistical analysis across different AI platforms.

Features

📊 Multi-platform support: Extract data from ChatGPT, Claude web, and Claude Code
🔢 Accurate token counting: Uses DeepSeek V3 tokenizer by default (configurable)
📈 Comprehensive analytics: Detailed statistics on usage patterns, token distribution, and conversation flows
🔄 Standardized output: Consistent CSV trace format across all platforms
📋 Rich statistics: JSON output with detailed breakdowns and human-readable summaries

Installation

Option 1: Using uvx (Recommended)

Run without installing:

# ChatGPT trace extraction
uvx --from kubera kubera-chatgpt-extract-trace --input-file conversations.json

# ChatGPT statistics
uvx --from kubera kubera-chatgpt-extract-stats --input-file chatgpt_trace.csv

# Claude Code trace extraction
uvx --from kubera kubera-claude-code-extract-trace --input-file usage.jsonl

# Claude Code statistics
uvx --from kubera kubera-claude-code-extract-stats --input-file claude_code_trace.csv

# Claude Web trace extraction
uvx --from kubera kubera-claude-web-extract-trace --input-file conversations.json

# Claude Web statistics
uvx --from kubera kubera-claude-web-extract-stats --input-file claude_web_trace.csv

Option 2: Traditional Installation

# Clone the repository
git clone https://github.com/project-vajra/kubera.git
cd kubera

# Install dependencies
pip install -e .

# Or install with development dependencies
pip install -e ".[dev]"

Supported Platforms

1. ChatGPT

Extract conversation data from ChatGPT exports including message chains, tokens, and timestamps.

2. Claude Web

Analyze Claude web conversations with response timing analysis and token breakdowns.

3. Claude Code

Extract usage statistics from Claude Code JSONL files with cache efficiency metrics.

Quick Start

1. Export Your Data

ChatGPT Export

Go to ChatGPT Settings
Click "Export data" button
Download and extract the ZIP file to get conversations.json

Claude Web Export

Go to Claude Settings
Click "Export data" button
Download and extract to get conversations.json

Claude Code Data

Claude Code automatically stores usage data in ~/.claude/projects/ as JSONL files.

2. Extract Traces

# ChatGPT
python kubera/chatgpt/extract_trace.py --input-file path/to/chatgpt/conversations.json

# Claude Web  
python kubera/claude_web/extract_trace.py --input-file path/to/claude_web/conversations.json

# Claude Code
python kubera/claude_code/extract_trace.py --claude-dir ~/.claude

3. Generate Statistics

# ChatGPT analysis
python kubera/chatgpt/extract_stats.py --input-file data/chatgpt_trace.csv

# Claude Web analysis
python kubera/claude_web/extract_stats.py --input-file data/claude_web_trace.csv

# Claude Code analysis
python kubera/claude_code/extract_stats.py --input-file data/claude_code_trace.csv

Detailed Usage

ChatGPT Data Extraction

Trace Extraction

python kubera/chatgpt/extract_trace.py \
  --input-file raw_data/chatgpt/conversations.json \
  --output-file data/chatgpt_trace.csv \
  --tokenizer deepseek-ai/DeepSeek-V3

Output CSV Fields:

session_uuid: Conversation ID
message_uuid: Unique message identifier
parent_uuid: Parent message ID (conversation threading)
role: Message sender (user/assistant/system)
timestamp: Message creation time
tokens: Token count using specified tokenizer

Statistics Generation

python kubera/chatgpt/extract_stats.py \
  --input-file data/chatgpt_trace.csv \
  --output-file data/stats/chatgpt_stats.json

Statistics Include:

Overall message/conversation/token counts
Role distribution (user vs assistant messages)
Token analysis (averages, distribution by role)
Conversation patterns (length distribution, duration)
Conversation token breakdown by role

Claude Web Data Extraction

Trace Extraction

python kubera/claude_web/extract_trace.py \
  --input-file raw_data/claude_web/conversations.json \
  --output-file data/claude_web_trace.csv \
  --tokenizer deepseek-ai/DeepSeek-V3

Output CSV Fields:

session_uuid: Conversation UUID
message_uuid: Message UUID
parent_uuid: Parent message (empty for Claude web format)
role: Sender (human/assistant)
start_timestamp: Message start time
stop_timestamp: Message completion time
tokens: Token count using specified tokenizer

Statistics Generation

python kubera/claude_web/extract_stats.py \
  --input-file data/claude_web_trace.csv \
  --output-file data/stats/claude_web_stats.json

Statistics Include:

All ChatGPT statistics plus:
Response timing analysis (start/stop timestamps)
Response time distribution
Average response times

Claude Code Data Extraction

Trace Extraction

python kubera/claude_code/extract_trace.py \
  --claude-dir ~/.claude \
  --output-file data/claude_code_trace.csv

Output CSV Fields:

timestamp: Request timestamp
parentUuid: Parent message UUID
sessionId: Session identifier
uuid: Message UUID
input_tokens: Base input tokens
cache_creation_input_tokens: Cache creation tokens
cache_read_input_tokens: Cache read tokens
output_tokens: Response tokens
total_input_tokens: Sum of all input token types

Statistics Generation

python kubera/claude_code/extract_stats.py \
  --input-file data/claude_code_trace.csv \
  --output-file data/stats/claude_code_stats.json

Statistics Include:

Overall request/token statistics
Cache efficiency metrics
Session statistics (average requests, tokens, duration)
Token breakdown by type (input, cache, output)

Configuration Options

Tokenizer Selection

All extraction scripts support configurable tokenizers:

# Use DeepSeek V3 (default)
--tokenizer deepseek-ai/DeepSeek-V3

# Use Llama 3
--tokenizer meta-llama/Meta-Llama-3-8B

# Use GPT-4 tokenizer  
--tokenizer gpt-4

# Any HuggingFace tokenizer
--tokenizer <model-name>

Output Customization

# Custom output locations
--output-file /path/to/custom/output.csv
--output-file /path/to/custom/stats.json

# For Claude Code, custom source directory
--claude-dir /custom/claude/directory

Output Formats

Trace CSV Format

Standardized CSV format across all platforms with platform-specific fields:

Common: session_uuid, message_uuid, role, tokens
ChatGPT: timestamp, parent_uuid
Claude Web: start_timestamp, stop_timestamp, parent_uuid (empty)
Claude Code: timestamp, parentUuid, sessionId, input/output token breakdown

Statistics JSON Format

Comprehensive JSON with nested statistics:

{
  "overall": {
    "total_messages": 1250,
    "total_conversations": 45,
    "total_tokens": 125000,
    "role_distribution": {"user": 625, "assistant": 625}
  },
  "conversations": {
    "conv-uuid-1": {
      "messages": 10,
      "total_tokens": 2500,
      "tokens_by_role": {"user": 1000, "assistant": 1500},
      "duration_minutes": 15.5
    }
  },
  "token_analysis": {...},
  "conversation_patterns": {...}
}

Examples

Complete Workflow Example

# 1. Extract ChatGPT data
python kubera/chatgpt/extract_trace.py \
  --input-file raw_data/chatgpt/conversations.json \
  --output-file data/chatgpt_trace.csv

# 2. Generate statistics
python kubera/chatgpt/extract_stats.py \
  --input-file data/chatgpt_trace.csv \
  --output-file data/stats/chatgpt_stats.json

# 3. View results
cat data/stats/chatgpt_stats.json

Batch Processing Multiple Platforms

#!/bin/bash

# Extract traces from all platforms
python kubera/chatgpt/extract_trace.py --input-file raw_data/chatgpt/conversations.json
python kubera/claude_web/extract_trace.py --input-file raw_data/claude_web/conversations.json  
python kubera/claude_code/extract_trace.py

# Generate statistics for all platforms
python kubera/chatgpt/extract_stats.py --input-file data/chatgpt_trace.csv
python kubera/claude_web/extract_stats.py --input-file data/claude_web_trace.csv
python kubera/claude_code/extract_stats.py --input-file data/claude_code_trace.csv

echo "Analysis complete! Check data/stats/ for results."

Data Privacy

Kubera processes data locally and does not send any information to external servers. The tokenizers are downloaded once and cached locally. All analysis is performed on your machine.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

🐛 Bug Reports: GitHub Issues
💡 Feature Requests: GitHub Issues
📚 Documentation: GitHub Wiki

Roadmap

Support for additional AI platforms (Anthropic API, OpenAI API)
Advanced anonymization techniques
Interactive visualization dashboard
Automated trend analysis and insights
Integration with popular data science tools

Built by the Vajra Team for AI usage analytics and research.

Project details

These details have been verified by PyPI

Project links

Owner

Project Vajra

GitHub Statistics

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.1

Sep 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kubera-0.0.1.tar.gz (29.6 kB view details)

Uploaded Sep 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kubera-0.0.1-py3-none-any.whl (24.5 kB view details)

Uploaded Sep 10, 2025 Python 3

File details

Details for the file kubera-0.0.1.tar.gz.

File metadata

Download URL: kubera-0.0.1.tar.gz
Upload date: Sep 10, 2025
Size: 29.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kubera-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`186d4a9d9ffefeb8984be2d87aef3e8a1626826fb925e9206868144c5fa29ccb`
MD5	`ae60fb0a9c3f714e6feadcb92c4be93d`
BLAKE2b-256	`ac945ebec3b7d6bf979aea42be594448ad6322d78768f4c6e0e0f3048d6c3ff4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kubera-0.0.1.tar.gz:

Publisher: publish_release.yml on project-vajra/kubera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kubera-0.0.1.tar.gz
- Subject digest: 186d4a9d9ffefeb8984be2d87aef3e8a1626826fb925e9206868144c5fa29ccb
- Sigstore transparency entry: 492431418
- Sigstore integration time: Sep 10, 2025
Source repository:
- Permalink: project-vajra/kubera@f2a02b4517169bac733caa62b2cda0aa354289b7
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/project-vajra
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_release.yml@f2a02b4517169bac733caa62b2cda0aa354289b7
- Trigger Event: release

File details

Details for the file kubera-0.0.1-py3-none-any.whl.

File metadata

Download URL: kubera-0.0.1-py3-none-any.whl
Upload date: Sep 10, 2025
Size: 24.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kubera-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`058d320162fdcf4db8becd287b33d09cea842422fa529ecbb6af0ec45db41bb0`
MD5	`cca6fae4c13ba7bde16ad0739977dd2a`
BLAKE2b-256	`624ecc5695a411d1a72167aa7497d4cf14c07a2bfebcd477d3f64e51aeb37725`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kubera-0.0.1-py3-none-any.whl:

Publisher: publish_release.yml on project-vajra/kubera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kubera-0.0.1-py3-none-any.whl
- Subject digest: 058d320162fdcf4db8becd287b33d09cea842422fa529ecbb6af0ec45db41bb0
- Sigstore transparency entry: 492431432
- Sigstore integration time: Sep 10, 2025
Source repository:
- Permalink: project-vajra/kubera@f2a02b4517169bac733caa62b2cda0aa354289b7
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/project-vajra
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_release.yml@f2a02b4517169bac733caa62b2cda0aa354289b7
- Trigger Event: release

kubera 0.0.1

Navigation

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Meta

Classifiers

Project description

Kubera: AI Usage Data Extraction and Analysis

Features

Installation

Option 1: Using uvx (Recommended)

Option 2: Traditional Installation

Supported Platforms

1. ChatGPT

2. Claude Web

3. Claude Code

Quick Start

1. Export Your Data

ChatGPT Export

Claude Web Export

Claude Code Data

2. Extract Traces

3. Generate Statistics

Detailed Usage

ChatGPT Data Extraction

Trace Extraction

Statistics Generation

Claude Web Data Extraction

Trace Extraction

Statistics Generation

Claude Code Data Extraction

Trace Extraction

Statistics Generation

Configuration Options

Tokenizer Selection

Output Customization

Output Formats

Trace CSV Format

Statistics JSON Format

Examples

Complete Workflow Example

Batch Processing Multiple Platforms

Data Privacy

Contributing

License

Support

Roadmap

Project details

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance