Skip to main content

Kubera is a tool for annonymizing and extracting traces from from ChatGPT, Claude, etc. usage data

Project description

Kubera: AI Usage Data Extraction and Analysis

Kubera is a comprehensive tool for extracting, anonymizing, and analyzing usage traces from AI platforms including ChatGPT, Claude (web), and Claude Code. It provides standardized trace extraction, token counting using configurable tokenizers, and statistical analysis across different AI platforms.

Features

  • 📊 Multi-platform support: Extract data from ChatGPT, Claude web, and Claude Code
  • 🔢 Accurate token counting: Uses DeepSeek V3 tokenizer by default (configurable)
  • 📈 Comprehensive analytics: Detailed statistics on usage patterns, token distribution, and conversation flows
  • 🔄 Standardized output: Consistent CSV trace format across all platforms
  • 📋 Rich statistics: JSON output with detailed breakdowns and human-readable summaries

Installation

Option 1: Using uvx (Recommended)

Run without installing:

# ChatGPT trace extraction
uvx --from kubera kubera-chatgpt-extract-trace --input-file conversations.json

# ChatGPT statistics
uvx --from kubera kubera-chatgpt-extract-stats --input-file chatgpt_trace.csv

# Claude Code trace extraction
uvx --from kubera kubera-claude-code-extract-trace --input-file usage.jsonl

# Claude Code statistics
uvx --from kubera kubera-claude-code-extract-stats --input-file claude_code_trace.csv

# Claude Web trace extraction
uvx --from kubera kubera-claude-web-extract-trace --input-file conversations.json

# Claude Web statistics
uvx --from kubera kubera-claude-web-extract-stats --input-file claude_web_trace.csv

Option 2: Traditional Installation

# Clone the repository
git clone https://github.com/project-vajra/kubera.git
cd kubera

# Install dependencies
pip install -e .

# Or install with development dependencies
pip install -e ".[dev]"

Supported Platforms

1. ChatGPT

Extract conversation data from ChatGPT exports including message chains, tokens, and timestamps.

2. Claude Web

Analyze Claude web conversations with response timing analysis and token breakdowns.

3. Claude Code

Extract usage statistics from Claude Code JSONL files with cache efficiency metrics.

Quick Start

1. Export Your Data

ChatGPT Export

  1. Go to ChatGPT Settings

    ChatGPT Settings

  2. Click "Export data" button

    ChatGPT Export

  3. Download and extract the ZIP file to get conversations.json

Claude Web Export

  1. Go to Claude Settings

    Claude Web Settings

  2. Click "Export data" button

    Claude Web Export

  3. Download and extract to get conversations.json

Claude Code Data

Claude Code automatically stores usage data in ~/.claude/projects/ as JSONL files.

2. Extract Traces

# ChatGPT
python kubera/chatgpt/extract_trace.py --input-file path/to/chatgpt/conversations.json

# Claude Web  
python kubera/claude_web/extract_trace.py --input-file path/to/claude_web/conversations.json

# Claude Code
python kubera/claude_code/extract_trace.py --claude-dir ~/.claude

3. Generate Statistics

# ChatGPT analysis
python kubera/chatgpt/extract_stats.py --input-file data/chatgpt_trace.csv

# Claude Web analysis
python kubera/claude_web/extract_stats.py --input-file data/claude_web_trace.csv

# Claude Code analysis
python kubera/claude_code/extract_stats.py --input-file data/claude_code_trace.csv

Detailed Usage

ChatGPT Data Extraction

Trace Extraction

python kubera/chatgpt/extract_trace.py \
  --input-file raw_data/chatgpt/conversations.json \
  --output-file data/chatgpt_trace.csv \
  --tokenizer deepseek-ai/DeepSeek-V3

Output CSV Fields:

  • session_uuid: Conversation ID
  • message_uuid: Unique message identifier
  • parent_uuid: Parent message ID (conversation threading)
  • role: Message sender (user/assistant/system)
  • timestamp: Message creation time
  • tokens: Token count using specified tokenizer

Statistics Generation

python kubera/chatgpt/extract_stats.py \
  --input-file data/chatgpt_trace.csv \
  --output-file data/stats/chatgpt_stats.json

Statistics Include:

  • Overall message/conversation/token counts
  • Role distribution (user vs assistant messages)
  • Token analysis (averages, distribution by role)
  • Conversation patterns (length distribution, duration)
  • Conversation token breakdown by role

Claude Web Data Extraction

Trace Extraction

python kubera/claude_web/extract_trace.py \
  --input-file raw_data/claude_web/conversations.json \
  --output-file data/claude_web_trace.csv \
  --tokenizer deepseek-ai/DeepSeek-V3

Output CSV Fields:

  • session_uuid: Conversation UUID
  • message_uuid: Message UUID
  • parent_uuid: Parent message (empty for Claude web format)
  • role: Sender (human/assistant)
  • start_timestamp: Message start time
  • stop_timestamp: Message completion time
  • tokens: Token count using specified tokenizer

Statistics Generation

python kubera/claude_web/extract_stats.py \
  --input-file data/claude_web_trace.csv \
  --output-file data/stats/claude_web_stats.json

Statistics Include:

  • All ChatGPT statistics plus:
  • Response timing analysis (start/stop timestamps)
  • Response time distribution
  • Average response times

Claude Code Data Extraction

Trace Extraction

python kubera/claude_code/extract_trace.py \
  --claude-dir ~/.claude \
  --output-file data/claude_code_trace.csv

Output CSV Fields:

  • timestamp: Request timestamp
  • parentUuid: Parent message UUID
  • sessionId: Session identifier
  • uuid: Message UUID
  • input_tokens: Base input tokens
  • cache_creation_input_tokens: Cache creation tokens
  • cache_read_input_tokens: Cache read tokens
  • output_tokens: Response tokens
  • total_input_tokens: Sum of all input token types

Statistics Generation

python kubera/claude_code/extract_stats.py \
  --input-file data/claude_code_trace.csv \
  --output-file data/stats/claude_code_stats.json

Statistics Include:

  • Overall request/token statistics
  • Cache efficiency metrics
  • Session statistics (average requests, tokens, duration)
  • Token breakdown by type (input, cache, output)

Configuration Options

Tokenizer Selection

All extraction scripts support configurable tokenizers:

# Use DeepSeek V3 (default)
--tokenizer deepseek-ai/DeepSeek-V3

# Use Llama 3
--tokenizer meta-llama/Meta-Llama-3-8B

# Use GPT-4 tokenizer  
--tokenizer gpt-4

# Any HuggingFace tokenizer
--tokenizer <model-name>

Output Customization

# Custom output locations
--output-file /path/to/custom/output.csv
--output-file /path/to/custom/stats.json

# For Claude Code, custom source directory
--claude-dir /custom/claude/directory

Output Formats

Trace CSV Format

Standardized CSV format across all platforms with platform-specific fields:

  • Common: session_uuid, message_uuid, role, tokens
  • ChatGPT: timestamp, parent_uuid
  • Claude Web: start_timestamp, stop_timestamp, parent_uuid (empty)
  • Claude Code: timestamp, parentUuid, sessionId, input/output token breakdown

Statistics JSON Format

Comprehensive JSON with nested statistics:

{
  "overall": {
    "total_messages": 1250,
    "total_conversations": 45,
    "total_tokens": 125000,
    "role_distribution": {"user": 625, "assistant": 625}
  },
  "conversations": {
    "conv-uuid-1": {
      "messages": 10,
      "total_tokens": 2500,
      "tokens_by_role": {"user": 1000, "assistant": 1500},
      "duration_minutes": 15.5
    }
  },
  "token_analysis": {...},
  "conversation_patterns": {...}
}

Examples

Complete Workflow Example

# 1. Extract ChatGPT data
python kubera/chatgpt/extract_trace.py \
  --input-file raw_data/chatgpt/conversations.json \
  --output-file data/chatgpt_trace.csv

# 2. Generate statistics
python kubera/chatgpt/extract_stats.py \
  --input-file data/chatgpt_trace.csv \
  --output-file data/stats/chatgpt_stats.json

# 3. View results
cat data/stats/chatgpt_stats.json

Batch Processing Multiple Platforms

#!/bin/bash

# Extract traces from all platforms
python kubera/chatgpt/extract_trace.py --input-file raw_data/chatgpt/conversations.json
python kubera/claude_web/extract_trace.py --input-file raw_data/claude_web/conversations.json  
python kubera/claude_code/extract_trace.py

# Generate statistics for all platforms
python kubera/chatgpt/extract_stats.py --input-file data/chatgpt_trace.csv
python kubera/claude_web/extract_stats.py --input-file data/claude_web_trace.csv
python kubera/claude_code/extract_stats.py --input-file data/claude_code_trace.csv

echo "Analysis complete! Check data/stats/ for results."

Data Privacy

Kubera processes data locally and does not send any information to external servers. The tokenizers are downloaded once and cached locally. All analysis is performed on your machine.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Roadmap

  • Support for additional AI platforms (Anthropic API, OpenAI API)
  • Advanced anonymization techniques
  • Interactive visualization dashboard
  • Automated trend analysis and insights
  • Integration with popular data science tools

Built by the Vajra Team for AI usage analytics and research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kubera-0.0.1.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kubera-0.0.1-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file kubera-0.0.1.tar.gz.

File metadata

  • Download URL: kubera-0.0.1.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kubera-0.0.1.tar.gz
Algorithm Hash digest
SHA256 186d4a9d9ffefeb8984be2d87aef3e8a1626826fb925e9206868144c5fa29ccb
MD5 ae60fb0a9c3f714e6feadcb92c4be93d
BLAKE2b-256 ac945ebec3b7d6bf979aea42be594448ad6322d78768f4c6e0e0f3048d6c3ff4

See more details on using hashes here.

Provenance

The following attestation bundles were made for kubera-0.0.1.tar.gz:

Publisher: publish_release.yml on project-vajra/kubera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kubera-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: kubera-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kubera-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 058d320162fdcf4db8becd287b33d09cea842422fa529ecbb6af0ec45db41bb0
MD5 cca6fae4c13ba7bde16ad0739977dd2a
BLAKE2b-256 624ecc5695a411d1a72167aa7497d4cf14c07a2bfebcd477d3f64e51aeb37725

See more details on using hashes here.

Provenance

The following attestation bundles were made for kubera-0.0.1-py3-none-any.whl:

Publisher: publish_release.yml on project-vajra/kubera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page