Skip to main content

Extract and sanitize Claude Code conversation data for training datasets

Project description

๐Ÿค– Claude Collector

One command to extract all your Claude Code conversations for training datasets.

Quick Start

Install and run with uvx (recommended)

uvx claude-collector

That's it! The tool will:

  • โœ… Auto-find your Claude Code data (~/.claude/projects)
  • โœ… Extract all conversations
  • โœ… Sanitize PII (emails, API keys, paths)
  • โœ… Count total tokens
  • โœ… Save as training-ready JSONL

Or install globally

uv tool install claude-collector
claude-collector

What It Does

Scans your Claude Code session files and:

  1. Finds all conversation data in ~/.claude/projects
  2. Extracts user/assistant message pairs
  3. Sanitizes sensitive information:
    • Emails โ†’ [EMAIL]
    • API keys โ†’ [API_KEY]
    • File paths โ†’ /Users/[USER]/...
    • IP addresses โ†’ [IP]
    • OAuth tokens โ†’ [REDACTED]
  4. Counts actual token usage
  5. Saves as clean JSONL dataset

Example Output

๐Ÿค– Claude Collector v0.1.0
Extract & sanitize Claude Code conversations

โœ“ Found Claude data: /Users/z/.claude/projects

๐Ÿ“‚ Processing 1394 files...

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Metric              โ”‚ Value        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Files scanned       โ”‚ 1,394        โ”‚
โ”‚ Files with data     โ”‚ 1,273        โ”‚
โ”‚ Total messages      โ”‚ 46,029       โ”‚
โ”‚ Training examples   โ”‚ 3,653        โ”‚
โ”‚ Total tokens        โ”‚ 4.04B        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

โœ… Dataset saved!
   File: claude_dataset_20251113.jsonl
   Size: 19.13 MB
   Examples: 3,653

๐ŸŽ‰ Ready for training!

Options

# Dry run (see stats without saving)
uvx claude-collector --dry-run

# Custom output location
uvx claude-collector --output ~/my-dataset.jsonl

# Specify input directory
uvx claude-collector --input ~/.config/claude/projects

# Filter by minimum tokens
uvx claude-collector --min-tokens 1000

# Skip sanitization (NOT recommended for sharing!)
uvx claude-collector --no-sanitize

Use Cases

1. Create Training Dataset

uvx claude-collector --output training-data.jsonl

2. Audit Your Usage

uvx claude-collector --dry-run

Shows total tokens without saving.

3. Collect from Multiple Machines

On each computer:

# Machine 1
uvx claude-collector --output machine1-data.jsonl

# Machine 2  
uvx claude-collector --output machine2-data.jsonl

# Combine later
cat machine1-data.jsonl machine2-data.jsonl > combined-dataset.jsonl

4. Add to Existing Dataset

uvx claude-collector --output new-sessions.jsonl
cat existing-dataset.jsonl new-sessions.jsonl > updated-dataset.jsonl

Output Format

Each line is a JSON object:

{
  "messages": [
    {"role": "user", "content": "How do I..."},
    {"role": "assistant", "content": "You can..."}
  ],
  "metadata": {
    "timestamp": "2025-11-13T...",
    "tokens": {
      "input_tokens": 100,
      "output_tokens": 200,
      "cache_creation_input_tokens": 5000,
      "cache_read_input_tokens": 1000
    }
  }
}

Perfect for:

  • Fine-tuning LLMs
  • Training coding assistants
  • Building instruction datasets
  • Analysis and research

Finding Claude Data

Default Locations

~/.claude/projects/              # Primary
~/.config/claude/projects/       # Alternative

Check All Users

ls -la /Users/*/.claude/projects     # macOS
ls -la /home/*/.claude/projects      # Linux

Find Anywhere

find ~ -name "*.jsonl" -path "*/.claude/*" 2>/dev/null

Privacy & Security

โš ๏ธ Important: Claude Code logs contain sensitive data!

The tool sanitizes:

  • โœ… Email addresses
  • โœ… API keys and tokens
  • โœ… File paths (username removed)
  • โœ… IP addresses
  • โœ… OAuth credentials
  • โœ… Passwords

Still check before sharing:

  • Project names (if sensitive)
  • Company-specific terminology
  • Proprietary code patterns

For maximum privacy, review the output file before uploading anywhere.

Requirements

  • Python 3.8+
  • Claude Code installed (for data to exist)

Installation Methods

1. uvx (easiest, no install)

uvx claude-collector

2. uv tool (global install)

uv tool install claude-collector
claude-collector

3. pip

pip install claude-collector
claude-collector

4. From source

git clone https://github.com/hanzoai/claude-collector
cd claude-collector
uv pip install -e .
claude-collector

Troubleshooting

"No Claude Code data found"

  • Make sure Claude Code is installed
  • Check you've had at least one session
  • Try specifying path: --input ~/.claude/projects

"Only found a few conversations"

  • This is normal if you're new to Claude Code
  • Each session creates one file
  • More usage = more data

"Tokens show 0"

  • Some messages don't have usage tracking
  • This is normal for system messages
  • Real conversations will have token counts

Advanced: Custom Processing

import json

# Read dataset
with open('claude_dataset.jsonl', 'r') as f:
    for line in f:
        example = json.loads(line)
        
        # Access messages
        user_msg = example['messages'][0]['content']
        assistant_msg = example['messages'][1]['content']
        
        # Access metadata
        tokens = example['metadata']['tokens']
        timestamp = example['metadata']['timestamp']
        
        # Your custom processing here

License

MIT - Free to use for any purpose

Credits

Built by Hanzo AI for the AI development community.


Found a bug? Open an issue: https://github.com/hanzoai/claude-collector/issues Want to contribute? PRs welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_collector-0.1.0.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

claude_collector-0.1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file claude_collector-0.1.0.tar.gz.

File metadata

  • Download URL: claude_collector-0.1.0.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for claude_collector-0.1.0.tar.gz
Algorithm Hash digest
SHA256 17cb03f462851c620d1b23b2b175cc263aa7c2c7412d52fd1318dd2be7ac6c6b
MD5 38ad6d73f5d10fa99f809e881122bfee
BLAKE2b-256 29d00512fe3c4ae1dfc7fa9d1636eb23a2ecaa917c69fa571203b010a154f513

See more details on using hashes here.

File details

Details for the file claude_collector-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for claude_collector-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad8bfe5012653db7c64a49f754cb3877eb478c374826fe6ea0194e4eb79c89f9
MD5 87ec0089f28ac031922279661351b549
BLAKE2b-256 db0534d8510ac30d8efc5f8548a526c14ad6d987605a1ee45c73baee9dcc8204

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page