Skip to main content

Export your Claude Code conversations to Hugging Face as structured training data

Project description

DataClaw

Turn your Claude Code conversation history into structured training data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset — so the hours you spend coding with Claude can help make future models better.

DataClaw

All DataClaw datasets are tagged dataclawbrowse them all.

Getting started

Paste this into Claude Code (or any coding agent):

Help me export my Claude Code conversation history to Hugging Face using DataClaw.
Install it, set up the skill, then walk me through the process.

STEP 1 — INSTALL
  pip install dataclaw
  If that fails: git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install /tmp/dataclaw
  If that also fails, ask the user where the source is.

STEP 2 — INSTALL SKILL
  dataclaw update-skill claude

STEP 3 — START
  dataclaw prep
  Every dataclaw command outputs next_steps in its JSON — follow them through the entire flow.

IMPORTANT: Never run bare `huggingface-cli login` — always use --token.
IMPORTANT: Always export with --no-push first and review for PII before publishing.
Manual usage (without an agent)

Quick start

pip install dataclaw
huggingface-cli login --token YOUR_TOKEN

# See your projects
dataclaw prep

# Configure
dataclaw config --repo username/dataclaw-username
dataclaw config --exclude "personal-stuff,scratch"
dataclaw config --redact-usernames "my_github_handle,my_discord_name"
dataclaw config --redact "my-domain.com,my-secret-project"

# Export locally first
dataclaw export --no-push

# Review the JSONL, then push
dataclaw export

Commands

Command Description
dataclaw prep Discover projects, check HF auth, output JSON (read-only)
dataclaw list List all projects with exclusion status
dataclaw config Show current config
dataclaw config --repo user/dataclaw-user Set HF repo
dataclaw config --exclude "a,b" Add excluded projects (appends)
dataclaw config --redact "str1,str2" Add strings to always redact (appends)
dataclaw config --redact-usernames "u1,u2" Add usernames to anonymize (appends)
dataclaw export Export and push
dataclaw export --no-push Export locally only (review first)
dataclaw export --all-projects Include everything (ignore exclusions)
dataclaw export --no-thinking Exclude extended thinking blocks
dataclaw update-skill claude Install/update the dataclaw skill for Claude Code
What gets exported
Data Included Notes
User messages Yes Full text (including voice transcripts)
Assistant responses Yes Full text output
Extended thinking Yes Claude's reasoning (opt out with --no-thinking)
Tool calls Yes Tool name + summarized input
Tool results No Not stored in Claude Code's logs
Token usage Yes Input/output tokens per session
Model & metadata Yes Model name, git branch, timestamps

Privacy & Redaction

DataClaw applies multiple layers of protection:

  1. Path anonymization — File paths stripped to project-relative
  2. Username hashing — Your macOS username + any configured usernames replaced with stable hashes
  3. Secret detection — Regex patterns catch JWT tokens, API keys (Anthropic, OpenAI, HF, GitHub, AWS, etc.), database passwords, private keys, Discord webhooks, and more
  4. Entropy analysis — Long high-entropy strings in quotes are flagged as potential secrets
  5. Email redaction — Personal email addresses removed
  6. Custom redaction — You can configure additional strings and usernames to redact
  7. Tool input pre-redaction — Secrets in tool inputs are redacted BEFORE truncation to prevent partial leaks

This is NOT foolproof. Always review your exported data before publishing. Automated redaction cannot catch everything — especially service-specific identifiers, third-party PII, or secrets in unusual formats.

To help improve redaction, report issues: https://github.com/banodoco/dataclaw/issues

Data schema

Each line in conversations.jsonl is one session:

{
  "session_id": "abc-123",
  "project": "my-project",
  "model": "claude-opus-4-6",
  "git_branch": "main",
  "start_time": "2025-06-15T10:00:00+00:00",
  "end_time": "2025-06-15T10:30:00+00:00",
  "messages": [
    {"role": "user", "content": "Fix the login bug", "timestamp": "..."},
    {
      "role": "assistant",
      "content": "I'll investigate the login flow.",
      "thinking": "The user wants me to look at...",
      "tool_uses": [{"tool": "Read", "input": "src/auth.py"}],
      "timestamp": "..."
    }
  ],
  "stats": {
    "user_messages": 5, "assistant_messages": 8,
    "tool_uses": 20, "input_tokens": 50000, "output_tokens": 3000
  }
}

Each HF repo also includes a metadata.json with aggregate stats.

Finding datasets on Hugging Face

All repos are named {username}/dataclaw-{username} and tagged dataclaw.

  • Browse all: huggingface.co/datasets?other=dataclaw
  • Load one:
    from datasets import load_dataset
    ds = load_dataset("alice/dataclaw-alice", split="train")
    
  • Combine several:
    from datasets import load_dataset, concatenate_datasets
    repos = ["alice/dataclaw-alice", "bob/dataclaw-bob"]
    ds = concatenate_datasets([load_dataset(r, split="train") for r in repos])
    

The auto-generated HF README includes:

  • Model distribution (which Claude models, how many sessions each)
  • Total token counts
  • Project count
  • Last updated timestamp

Code Quality

Code Quality Scorecard

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataclaw-0.1.0.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataclaw-0.1.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file dataclaw-0.1.0.tar.gz.

File metadata

  • Download URL: dataclaw-0.1.0.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for dataclaw-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c4219c0c4fb92a21488f79931a8a5b7f24ffa3401aa261b1aaa9cdff92c298fa
MD5 9f0d8557a2cadacc7472aa983c4d166c
BLAKE2b-256 c247a18bd1a9c005df8fb309695d941d4c08e6789ef6bd6dcd4470f496749393

See more details on using hashes here.

File details

Details for the file dataclaw-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataclaw-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for dataclaw-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9ebd372c7841e59b7c6fce0aed151b2da8565e47a251886d0db4c44600ed56c
MD5 d0ee9258622b071feac51ce416f34dd5
BLAKE2b-256 7c9710433919b0ef0f6975c65c3ef64be91e34b7664fbcacb9f0dac0ffee57d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page