Export your Claude Code conversations to Hugging Face as structured training data
Project description
DataClaw
Turn your Claude Code conversation history into structured training data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset — so the hours you spend coding with Claude can help make future models better.
All DataClaw datasets are tagged dataclaw — browse them all.
Getting started
Paste this into Claude Code (or any coding agent):
Help me export my Claude Code conversation history to Hugging Face using DataClaw.
Install it, set up the skill, then walk me through the process.
STEP 1 — INSTALL
pip install dataclaw
If that fails: git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install /tmp/dataclaw
If that also fails, ask the user where the source is.
STEP 2 — INSTALL SKILL
dataclaw update-skill claude
STEP 3 — START
dataclaw prep
Every dataclaw command outputs next_steps in its JSON — follow them through the entire flow.
IMPORTANT: Never run bare `huggingface-cli login` — always use --token.
IMPORTANT: Always export with --no-push first and review for PII before publishing.
Manual usage (without an agent)
Quick start
pip install dataclaw
huggingface-cli login --token YOUR_TOKEN
# See your projects
dataclaw prep
# Configure
dataclaw config --repo username/dataclaw-username
dataclaw config --exclude "personal-stuff,scratch"
dataclaw config --redact-usernames "my_github_handle,my_discord_name"
dataclaw config --redact "my-domain.com,my-secret-project"
# Export locally first
dataclaw export --no-push
# Review the JSONL, then push
dataclaw export
Commands
| Command | Description |
|---|---|
dataclaw prep |
Discover projects, check HF auth, output JSON (read-only) |
dataclaw list |
List all projects with exclusion status |
dataclaw config |
Show current config |
dataclaw config --repo user/dataclaw-user |
Set HF repo |
dataclaw config --exclude "a,b" |
Add excluded projects (appends) |
dataclaw config --redact "str1,str2" |
Add strings to always redact (appends) |
dataclaw config --redact-usernames "u1,u2" |
Add usernames to anonymize (appends) |
dataclaw export |
Export and push |
dataclaw export --no-push |
Export locally only (review first) |
dataclaw export --all-projects |
Include everything (ignore exclusions) |
dataclaw export --no-thinking |
Exclude extended thinking blocks |
dataclaw update-skill claude |
Install/update the dataclaw skill for Claude Code |
What gets exported
| Data | Included | Notes |
|---|---|---|
| User messages | Yes | Full text (including voice transcripts) |
| Assistant responses | Yes | Full text output |
| Extended thinking | Yes | Claude's reasoning (opt out with --no-thinking) |
| Tool calls | Yes | Tool name + summarized input |
| Tool results | No | Not stored in Claude Code's logs |
| Token usage | Yes | Input/output tokens per session |
| Model & metadata | Yes | Model name, git branch, timestamps |
Privacy & Redaction
DataClaw applies multiple layers of protection:
- Path anonymization — File paths stripped to project-relative
- Username hashing — Your macOS username + any configured usernames replaced with stable hashes
- Secret detection — Regex patterns catch JWT tokens, API keys (Anthropic, OpenAI, HF, GitHub, AWS, etc.), database passwords, private keys, Discord webhooks, and more
- Entropy analysis — Long high-entropy strings in quotes are flagged as potential secrets
- Email redaction — Personal email addresses removed
- Custom redaction — You can configure additional strings and usernames to redact
- Tool input pre-redaction — Secrets in tool inputs are redacted BEFORE truncation to prevent partial leaks
This is NOT foolproof. Always review your exported data before publishing. Automated redaction cannot catch everything — especially service-specific identifiers, third-party PII, or secrets in unusual formats.
To help improve redaction, report issues: https://github.com/banodoco/dataclaw/issues
Data schema
Each line in conversations.jsonl is one session:
{
"session_id": "abc-123",
"project": "my-project",
"model": "claude-opus-4-6",
"git_branch": "main",
"start_time": "2025-06-15T10:00:00+00:00",
"end_time": "2025-06-15T10:30:00+00:00",
"messages": [
{"role": "user", "content": "Fix the login bug", "timestamp": "..."},
{
"role": "assistant",
"content": "I'll investigate the login flow.",
"thinking": "The user wants me to look at...",
"tool_uses": [{"tool": "Read", "input": "src/auth.py"}],
"timestamp": "..."
}
],
"stats": {
"user_messages": 5, "assistant_messages": 8,
"tool_uses": 20, "input_tokens": 50000, "output_tokens": 3000
}
}
Each HF repo also includes a metadata.json with aggregate stats.
Finding datasets on Hugging Face
All repos are named {username}/dataclaw-{username} and tagged dataclaw.
- Browse all: huggingface.co/datasets?other=dataclaw
- Load one:
from datasets import load_dataset ds = load_dataset("alice/dataclaw-alice", split="train")
- Combine several:
from datasets import load_dataset, concatenate_datasets repos = ["alice/dataclaw-alice", "bob/dataclaw-bob"] ds = concatenate_datasets([load_dataset(r, split="train") for r in repos])
The auto-generated HF README includes:
- Model distribution (which Claude models, how many sessions each)
- Total token counts
- Project count
- Last updated timestamp
Code Quality
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataclaw-0.1.0.tar.gz.
File metadata
- Download URL: dataclaw-0.1.0.tar.gz
- Upload date:
- Size: 31.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4219c0c4fb92a21488f79931a8a5b7f24ffa3401aa261b1aaa9cdff92c298fa
|
|
| MD5 |
9f0d8557a2cadacc7472aa983c4d166c
|
|
| BLAKE2b-256 |
c247a18bd1a9c005df8fb309695d941d4c08e6789ef6bd6dcd4470f496749393
|
File details
Details for the file dataclaw-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataclaw-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9ebd372c7841e59b7c6fce0aed151b2da8565e47a251886d0db4c44600ed56c
|
|
| MD5 |
d0ee9258622b071feac51ce416f34dd5
|
|
| BLAKE2b-256 |
7c9710433919b0ef0f6975c65c3ef64be91e34b7664fbcacb9f0dac0ffee57d2
|