Extract and sanitize Claude Code conversation data for training datasets
Project description
๐ค Claude Collector
One command to extract all your Claude Code conversations for training datasets.
Quick Start
Install and run with uvx (recommended)
uvx claude-collector
That's it! The tool will:
- โ
Auto-find your Claude Code data (
~/.claude/projects) - โ Extract all conversations
- โ Sanitize PII (emails, API keys, paths)
- โ Count total tokens
- โ Save as training-ready JSONL
Or install globally
uv tool install claude-collector
claude-collector
What It Does
Scans your Claude Code session files and:
- Finds all conversation data in
~/.claude/projects - Extracts user/assistant message pairs
- Sanitizes sensitive information:
- Emails โ
[EMAIL] - API keys โ
[API_KEY] - File paths โ
/Users/[USER]/... - IP addresses โ
[IP] - OAuth tokens โ
[REDACTED]
- Emails โ
- Counts actual token usage
- Saves as clean JSONL dataset
Example Output
๐ค Claude Collector v0.1.0
Extract & sanitize Claude Code conversations
โ Found Claude data: /Users/z/.claude/projects
๐ Processing 1394 files...
โญโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฎ
โ Metric โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ Files scanned โ 1,394 โ
โ Files with data โ 1,273 โ
โ Total messages โ 46,029 โ
โ Training examples โ 3,653 โ
โ Total tokens โ 4.04B โ
โฐโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโฏ
โ
Dataset saved!
File: claude_dataset_20251113.jsonl
Size: 19.13 MB
Examples: 3,653
๐ Ready for training!
Options
# Dry run (see stats without saving)
uvx claude-collector --dry-run
# Custom output location
uvx claude-collector --output ~/my-dataset.jsonl
# Specify input directory
uvx claude-collector --input ~/.config/claude/projects
# Filter by minimum tokens
uvx claude-collector --min-tokens 1000
# Skip sanitization (NOT recommended for sharing!)
uvx claude-collector --no-sanitize
Use Cases
1. Create Training Dataset
uvx claude-collector --output training-data.jsonl
2. Audit Your Usage
uvx claude-collector --dry-run
Shows total tokens without saving.
3. Collect from Multiple Machines
On each computer:
# Machine 1
uvx claude-collector --output machine1-data.jsonl
# Machine 2
uvx claude-collector --output machine2-data.jsonl
# Combine later
cat machine1-data.jsonl machine2-data.jsonl > combined-dataset.jsonl
4. Add to Existing Dataset
uvx claude-collector --output new-sessions.jsonl
cat existing-dataset.jsonl new-sessions.jsonl > updated-dataset.jsonl
Output Format
Each line is a JSON object:
{
"messages": [
{"role": "user", "content": "How do I..."},
{"role": "assistant", "content": "You can..."}
],
"metadata": {
"timestamp": "2025-11-13T...",
"tokens": {
"input_tokens": 100,
"output_tokens": 200,
"cache_creation_input_tokens": 5000,
"cache_read_input_tokens": 1000
}
}
}
Perfect for:
- Fine-tuning LLMs
- Training coding assistants
- Building instruction datasets
- Analysis and research
Finding Claude Data
Default Locations
~/.claude/projects/ # Primary
~/.config/claude/projects/ # Alternative
Check All Users
ls -la /Users/*/.claude/projects # macOS
ls -la /home/*/.claude/projects # Linux
Find Anywhere
find ~ -name "*.jsonl" -path "*/.claude/*" 2>/dev/null
Privacy & Security
โ ๏ธ Important: Claude Code logs contain sensitive data!
The tool sanitizes:
- โ Email addresses
- โ API keys and tokens
- โ File paths (username removed)
- โ IP addresses
- โ OAuth credentials
- โ Passwords
Still check before sharing:
- Project names (if sensitive)
- Company-specific terminology
- Proprietary code patterns
For maximum privacy, review the output file before uploading anywhere.
Requirements
- Python 3.8+
- Claude Code installed (for data to exist)
Installation Methods
1. uvx (easiest, no install)
uvx claude-collector
2. uv tool (global install)
uv tool install claude-collector
claude-collector
3. pip
pip install claude-collector
claude-collector
4. From source
git clone https://github.com/hanzoai/claude-collector
cd claude-collector
uv pip install -e .
claude-collector
Troubleshooting
"No Claude Code data found"
- Make sure Claude Code is installed
- Check you've had at least one session
- Try specifying path:
--input ~/.claude/projects
"Only found a few conversations"
- This is normal if you're new to Claude Code
- Each session creates one file
- More usage = more data
"Tokens show 0"
- Some messages don't have usage tracking
- This is normal for system messages
- Real conversations will have token counts
Advanced: Custom Processing
import json
# Read dataset
with open('claude_dataset.jsonl', 'r') as f:
for line in f:
example = json.loads(line)
# Access messages
user_msg = example['messages'][0]['content']
assistant_msg = example['messages'][1]['content']
# Access metadata
tokens = example['metadata']['tokens']
timestamp = example['metadata']['timestamp']
# Your custom processing here
License
MIT - Free to use for any purpose
Credits
Built by Hanzo AI for the AI development community.
Found a bug? Open an issue: https://github.com/hanzoai/claude-collector/issues Want to contribute? PRs welcome!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file claude_collector-0.1.0.tar.gz.
File metadata
- Download URL: claude_collector-0.1.0.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17cb03f462851c620d1b23b2b175cc263aa7c2c7412d52fd1318dd2be7ac6c6b
|
|
| MD5 |
38ad6d73f5d10fa99f809e881122bfee
|
|
| BLAKE2b-256 |
29d00512fe3c4ae1dfc7fa9d1636eb23a2ecaa917c69fa571203b010a154f513
|
File details
Details for the file claude_collector-0.1.0-py3-none-any.whl.
File metadata
- Download URL: claude_collector-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad8bfe5012653db7c64a49f754cb3877eb478c374826fe6ea0194e4eb79c89f9
|
|
| MD5 |
87ec0089f28ac031922279661351b549
|
|
| BLAKE2b-256 |
db0534d8510ac30d8efc5f8548a526c14ad6d987605a1ee45c73baee9dcc8204
|