Export your Claude Code conversations to Hugging Face as structured training data

These details have not been verified by PyPI

Project links

Project description

DataClaw

Turn your Claude Code conversation history into structured training data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset — so the hours you spend coding with Claude can help make future models better.

DataClaw

All DataClaw datasets are tagged dataclaw — browse them all.

Getting started

Paste this into Claude Code (or any coding agent):

Help me export my Claude Code conversation history to Hugging Face using DataClaw.
Install it, set up the skill, then walk me through the process.

STEP 1 — INSTALL
  pip install dataclaw
  If that fails: git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install /tmp/dataclaw
  If that also fails, ask the user where the source is.

STEP 2 — INSTALL SKILL
  dataclaw update-skill claude

STEP 3 — START
  dataclaw prep
  Every dataclaw command outputs next_steps in its JSON — follow them through the entire flow.

IMPORTANT: Never run bare `huggingface-cli login` — always use --token.
IMPORTANT: Always export with --no-push first and review for PII before publishing.

Manual usage (without an agent)

Quick start

pip install dataclaw
huggingface-cli login --token YOUR_TOKEN

# See your projects
dataclaw prep

# Configure
dataclaw config --repo username/dataclaw-username
dataclaw config --exclude "personal-stuff,scratch"
dataclaw config --redact-usernames "my_github_handle,my_discord_name"
dataclaw config --redact "my-domain.com,my-secret-project"

# Export locally first
dataclaw export --no-push

# Review the JSONL, then push
dataclaw export

Commands

Command	Description
`dataclaw prep`	Discover projects, check HF auth, output JSON (read-only)
`dataclaw list`	List all projects with exclusion status
`dataclaw config`	Show current config
`dataclaw config --repo user/dataclaw-user`	Set HF repo
`dataclaw config --exclude "a,b"`	Add excluded projects (appends)
`dataclaw config --redact "str1,str2"`	Add strings to always redact (appends)
`dataclaw config --redact-usernames "u1,u2"`	Add usernames to anonymize (appends)
`dataclaw export`	Export and push
`dataclaw export --no-push`	Export locally only (review first)
`dataclaw export --all-projects`	Include everything (ignore exclusions)
`dataclaw export --no-thinking`	Exclude extended thinking blocks
`dataclaw update-skill claude`	Install/update the dataclaw skill for Claude Code

What gets exported

Data	Included	Notes
User messages	Yes	Full text (including voice transcripts)
Assistant responses	Yes	Full text output
Extended thinking	Yes	Claude's reasoning (opt out with `--no-thinking`)
Tool calls	Yes	Tool name + summarized input
Tool results	No	Not stored in Claude Code's logs
Token usage	Yes	Input/output tokens per session
Model & metadata	Yes	Model name, git branch, timestamps

Privacy & Redaction

DataClaw applies multiple layers of protection:

Path anonymization — File paths stripped to project-relative
Username hashing — Your macOS username + any configured usernames replaced with stable hashes
Secret detection — Regex patterns catch JWT tokens, API keys (Anthropic, OpenAI, HF, GitHub, AWS, etc.), database passwords, private keys, Discord webhooks, and more
Entropy analysis — Long high-entropy strings in quotes are flagged as potential secrets
Email redaction — Personal email addresses removed
Custom redaction — You can configure additional strings and usernames to redact
Tool input pre-redaction — Secrets in tool inputs are redacted BEFORE truncation to prevent partial leaks

This is NOT foolproof. Always review your exported data before publishing. Automated redaction cannot catch everything — especially service-specific identifiers, third-party PII, or secrets in unusual formats.

To help improve redaction, report issues: https://github.com/banodoco/dataclaw/issues

Data schema

Each line in conversations.jsonl is one session:

{
  "session_id": "abc-123",
  "project": "my-project",
  "model": "claude-opus-4-6",
  "git_branch": "main",
  "start_time": "2025-06-15T10:00:00+00:00",
  "end_time": "2025-06-15T10:30:00+00:00",
  "messages": [
    {"role": "user", "content": "Fix the login bug", "timestamp": "..."},
    {
      "role": "assistant",
      "content": "I'll investigate the login flow.",
      "thinking": "The user wants me to look at...",
      "tool_uses": [{"tool": "Read", "input": "src/auth.py"}],
      "timestamp": "..."
    }
  ],
  "stats": {
    "user_messages": 5, "assistant_messages": 8,
    "tool_uses": 20, "input_tokens": 50000, "output_tokens": 3000
  }
}

Each HF repo also includes a metadata.json with aggregate stats.

Finding datasets on Hugging Face

All repos are named {username}/dataclaw-{username} and tagged dataclaw.

Browse all: huggingface.co/datasets?other=dataclaw

Load one:

from datasets import load_dataset
ds = load_dataset("alice/dataclaw-alice", split="train")

Combine several:

from datasets import load_dataset, concatenate_datasets
repos = ["alice/dataclaw-alice", "bob/dataclaw-bob"]
ds = concatenate_datasets([load_dataset(r, split="train") for r in repos])

The auto-generated HF README includes:

Model distribution (which Claude models, how many sessions each)
Total token counts
Project count
Last updated timestamp

Code Quality

Code Quality Scorecard

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.1

Apr 18, 2026

0.4.0

Apr 2, 2026

0.3.2

Feb 26, 2026

0.3.1

Feb 26, 2026

0.3.0

Feb 26, 2026

0.2.1

Feb 24, 2026

0.2.0

Feb 24, 2026

This version

0.1.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataclaw-0.1.0.tar.gz (31.0 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataclaw-0.1.0-py3-none-any.whl (21.3 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file dataclaw-0.1.0.tar.gz.

File metadata

Download URL: dataclaw-0.1.0.tar.gz
Upload date: Feb 24, 2026
Size: 31.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for dataclaw-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c4219c0c4fb92a21488f79931a8a5b7f24ffa3401aa261b1aaa9cdff92c298fa`
MD5	`9f0d8557a2cadacc7472aa983c4d166c`
BLAKE2b-256	`c247a18bd1a9c005df8fb309695d941d4c08e6789ef6bd6dcd4470f496749393`

See more details on using hashes here.

File details

Details for the file dataclaw-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataclaw-0.1.0-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for dataclaw-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b9ebd372c7841e59b7c6fce0aed151b2da8565e47a251886d0db4c44600ed56c`
MD5	`d0ee9258622b071feac51ce416f34dd5`
BLAKE2b-256	`7c9710433919b0ef0f6975c65c3ef64be91e34b7664fbcacb9f0dac0ffee57d2`

See more details on using hashes here.

dataclaw 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataClaw

Getting started

Quick start

Commands

Privacy & Redaction

Code Quality

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes