Export your coding agent conversations to Hugging Face as structured data

These details have not been verified by PyPI

Project links

Project description

DataClaw

This is a performance art project. Anthropic built their models on the world's freely shared information, then introduced increasingly dystopian data policies to stop anyone else from doing the same with their data - pulling up the ladder behind them. DataClaw lets you throw the ladder back down. The dataset it produces is yours to share.

Turn your Claude Code, Codex, and other coding-agent conversation history into structured data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset.

DataClaw

Every export is tagged dataclaw on Hugging Face. Together, they may someday form a growing distributed dataset of real-world human-AI coding collaboration.

Give this to your agent

Paste this into Claude Code, Codex, or any coding agent:

Help me export my Claude Code, Codex, and other coding-agent conversation history to Hugging Face using DataClaw.
Install it, then walk me through the process.

STEP 1 - INSTALL
  pip install -U dataclaw
  If that fails: git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install -U /tmp/dataclaw
  If that also fails, ask the user where the source is.

STEP 2 - INSTALL SKILL
  Skill support is currently only available for Claude Code.
  dataclaw update-skill claude
  For other agentic tools, skip this step and do not improvise a custom flow - follow the instructions in DataClaw's output on each step, especially next_steps and next_command.

STEP 3 - PREP
  dataclaw prep
  Every dataclaw command outputs next_steps in its JSON - follow them through the entire flow.

STEP 3A - CHOOSE SOURCE SCOPE (REQUIRED BEFORE EXPORT)
  Ask the user explicitly which source scope to export: a supported source key such as claude or codex, or all.
  dataclaw config --source all
  Do not export until source scope is explicitly confirmed.

STEP 3B - CHOOSE PROJECT SCOPE (REQUIRED BEFORE EXPORT)
  dataclaw list --source all
  Send the FULL project/folder list to the user in a message (name, source, sessions, size, excluded).
  Ask which projects to exclude.
  dataclaw config --exclude "project1,project2" OR dataclaw config --confirm-projects
  Do not export until folder selection is explicitly confirmed.

STEP 3C - SET REDACTED STRINGS
  Ask the user what additional strings should always be redacted, such as company names, client names, domains, internal URLs, or secrets that regex might miss.
  dataclaw config --redact "string1,string2"
  dataclaw config --redact-usernames "user1,user2"
  Only add these after explicit user confirmation.

STEP 4 - EXPORT LOCALLY
  dataclaw export --no-push --output dataclaw_export.jsonl

STEP 5 - REVIEW AND CONFIRM (REQUIRED BEFORE PUSH)
  Review PII findings and apply excludes/redactions as needed.
  Full name is requested for an exact-name privacy scan against the export.
  If the user declines sharing full name, use --skip-full-name-scan and attest the skip reason.
  dataclaw confirm --full-name "THEIR FULL NAME" --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..."

STEP 6 - PUBLISH (ONLY AFTER EXPLICIT USER APPROVAL)
  dataclaw export --publish-attestation "User explicitly approved publishing to Hugging Face."
  Never publish unless the user explicitly says yes.

IF ANY COMMAND FAILS DUE TO A SKIPPED STEP:
  Restate the 6-step checklist above and resume from the blocked step (do not skip ahead).

IMPORTANT: Never run bare `hf auth login` when automating this with an agent - always use `--token`.
IMPORTANT: Always export with --no-push first and review for PII before publishing.

Manual usage (without an agent)

# STEP 1 - INSTALL
pip install -U dataclaw
hf auth login --token YOUR_TOKEN

# STEP 3 - PREP
dataclaw prep
dataclaw config --repo username/my-personal-codex-data

# STEP 3A - CHOOSE SOURCE SCOPE
dataclaw config --source all  # REQUIRED: choose a supported source key or all

# STEP 3B - CHOOSE PROJECT SCOPE
dataclaw list --source all  # Present full list and confirm folder scope before export
dataclaw config --exclude "personal-stuff,scratch"  # or: dataclaw config --confirm-projects

# STEP 3C - SET REDACTED STRINGS
dataclaw config --redact-usernames "my_github_handle,my_discord_name"
dataclaw config --redact "my-domain.com,my-secret-project"

# STEP 4 - EXPORT LOCALLY
dataclaw export --no-push

# STEP 5 - REVIEW AND CONFIRM
dataclaw confirm \
  --full-name "YOUR FULL NAME" \
  --attest-full-name "Asked for full name and scanned export for YOUR FULL NAME." \
  --attest-sensitive "Asked about company/client/internal names and private URLs; none found or redactions updated." \
  --attest-manual-scan "Manually scanned 20 sessions across beginning/middle/end and reviewed findings."

# Or: if user declines sharing full name
dataclaw confirm \
  --skip-full-name-scan \
  --attest-full-name "User declined to share full name; skipped exact-name scan." \
  --attest-sensitive "Asked about company/client/internal names and private URLs; none found or redactions updated." \
  --attest-manual-scan "Manually scanned 20 sessions across beginning/middle/end and reviewed findings."

# STEP 6 - PUBLISH
dataclaw export --publish-attestation "User explicitly approved publishing to Hugging Face."

Step 2 (INSTALL SKILL) is omitted in manual usage.

Commands

Command	Description
`dataclaw status`	Show current stage and next steps
`dataclaw prep`	Discover projects, check HF auth, output JSON
`dataclaw prep --source <source\|all>`	Prep with an explicit source scope
`dataclaw list`	List all projects with exclusion status
`dataclaw list --source <source\|all>`	List projects for a specific source scope
`dataclaw config`	Show current config
`dataclaw config --repo user/my-personal-codex-data`	Set HF repo
`dataclaw config --source <source\|all>`	REQUIRED source scope selection (examples include `claude`, `codex`, and others)
`dataclaw config --exclude "a,b"`	Add excluded projects (appends)
`dataclaw config --redact "str1,str2"`	Add strings to always redact (appends)
`dataclaw config --redact-usernames "u1,u2"`	Add usernames to anonymize (appends)
`dataclaw config --confirm-projects`	Mark project selection as confirmed
`dataclaw export --no-push`	Export locally only (always do this first)
`dataclaw export --source <source\|all> --no-push`	Export a chosen source scope locally
`dataclaw confirm --full-name "NAME" --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..."`	Scan for PII, run exact-name privacy check, verify review attestations, unlock pushing
`dataclaw confirm --skip-full-name-scan --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..."`	Skip exact-name scan when user declines sharing full name (requires skip attestation)
`dataclaw export --publish-attestation "..."`	Export and push (requires `dataclaw confirm` first)
`dataclaw export --all-projects`	Include everything (ignore exclusions)
`dataclaw export --no-thinking`	Exclude extended thinking blocks
`dataclaw jsonl-to-yaml [input.jsonl]`	Convert an export JSONL file to human-readable YAML
`dataclaw diff-jsonl --old old.jsonl --new new.jsonl`	Structurally diff two export JSONL files and write YAML
`dataclaw update-skill claude`	Install/update the dataclaw skill for Claude Code

Set DATACLAW_WORKERS to control the worker count used by parallel operations such as export, confirm, and diff-jsonl.

What gets exported

User messages - Including voice transcripts and images
Assistant responses
Assistant thinking - Opt out with --no-thinking
Tool calls - Tool name, inputs, outputs
Token usage - Input/output tokens per session
Metadata - Model name, git branch, timestamps

Privacy & Redaction

DataClaw applies multiple layers of protection:

Username redaction - Your OS username + any configured usernames replaced with stable hashes
Secret redaction - Regex patterns catch JWT tokens, API keys (Anthropic, OpenAI, HF, GitHub, AWS, etc.), database passwords, private keys, Discord webhooks, and more
Entropy analysis - Long high-entropy strings in quotes are flagged as potential secrets
Email redaction - Regex pattern catches email addresses
Custom redaction - You can configure additional strings to redact
Tool call redaction - Tool inputs and outputs are redacted with the same standard as regular messages

This is NOT foolproof. Always review your exported data before publishing. Automated redaction cannot catch everything - especially service-specific identifiers, third-party PII, or secrets in unusual formats.

We recommend converting the exported jsonl into human-readable yaml using dataclaw jsonl-to-yaml, then use tools such as trufflehog and gitleaks to scan it. You can also compare the exported jsonl with a previous baseline using dataclaw diff-jsonl.

To help improve redaction, report issues: https://github.com/banodoco/dataclaw/issues

Data schema

Each line in conversations.jsonl is one session:

{
  "session_id": "abc-123",
  "project": "my-project",
  "model": "claude-opus-4-6",
  "git_branch": "main",
  "start_time": "2025-06-15T10:00:00+00:00",
  "end_time": "2025-06-15T10:30:00+00:00",
  "messages": [
    {
      "role": "user",
      "content": "Fix the login bug",
      "content_parts": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}
      ],
      "timestamp": "..."
    },
    {
      "role": "assistant",
      "content": "I'll investigate the login flow.",
      "thinking": "The user wants me to look at...",
      "tool_uses": [
          {
            "tool": "bash",
            "input": {"command": "grep -r 'login' src/"},
            "output": {
              "text": "src/auth.py:42: def login(user, password):",
              "raw": {"stderr": "", "interrupted": false}
            },
            "status": "success"
          }
        ],
      "timestamp": "..."
    }
  ],
  "stats": {
    "user_messages": 5, "assistant_messages": 8,
    "tool_uses": 20, "input_tokens": 50000, "output_tokens": 3000
  }
}

messages[].content_parts is optional and preserves structured user content such as attachments when the source provides them. The canonical human-readable user text remains in messages[].content.

tool_uses[].output.raw is optional and preserves extra structured tool-result fields when the source provides them. The canonical human-readable result text remains in tool_uses[].output.text.

Each HF repo also includes a metadata.json with aggregate stats.

Finding datasets on Hugging Face

All repos are tagged dataclaw.

Browse all: huggingface.co/datasets?other=dataclaw

Load one:

from datasets import load_dataset
ds = load_dataset("alice/my-personal-codex-data", split="train")

Combine several:

from datasets import load_dataset, concatenate_datasets
repos = ["alice/my-personal-codex-data", "bob/my-personal-codex-data"]
ds = concatenate_datasets([load_dataset(r, split="train") for r in repos])

The auto-generated HF README includes:

Model distribution (which models, how many sessions each)
Total token counts
Project count
Last updated timestamp

Contributing

Missing data: If you found any data not exported, please report an issue. You can ask your coding agent to analyze the data, export it in this repo, and open a PR.

Better scheme: If you need to clean the data and want to propose a better scheme, feel free to open an issue.

New provider: If you use a new coding agent, you can ask it to read this repo and export its data as a new provider. Take Claude Code and Codex parsers as examples because they are the most well maintained. When you finish, ask the following questions:

Did you follow the scheme above? Currently it's free to add custom fields in messages[].content_parts and tool_uses[].output.raw.
Did you export all data, especially:
- tool call inputs and outputs
- long inputs and outputs that may be saved somewhere else
- binary content (may be encoded as base64) such as images. We do not apply anonymizer on binary content
- subagents
Does the coding agent automatically delete old sessions? How to prevent this?

Code Quality

Code Quality Scorecard

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.1

Apr 18, 2026

0.4.0

Apr 2, 2026

0.3.2

Feb 26, 2026

0.3.1

Feb 26, 2026

0.3.0

Feb 26, 2026

0.2.1

Feb 24, 2026

0.2.0

Feb 24, 2026

0.1.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataclaw-0.4.1.tar.gz (107.8 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataclaw-0.4.1-py3-none-any.whl (84.1 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file dataclaw-0.4.1.tar.gz.

File metadata

Download URL: dataclaw-0.4.1.tar.gz
Upload date: Apr 18, 2026
Size: 107.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataclaw-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`419f274cc34a6ad46489fb67a4c81bbf448760642d961f899bfc75f6f97de840`
MD5	`58265cb3c657f15989d191f3a1d25b78`
BLAKE2b-256	`b8376b0154abc400011ef0d086e153aa2f4c68c76d3b703ee305add1f1be406d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataclaw-0.4.1.tar.gz:

Publisher: publish.yml on peteromallet/dataclaw

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataclaw-0.4.1.tar.gz
- Subject digest: 419f274cc34a6ad46489fb67a4c81bbf448760642d961f899bfc75f6f97de840
- Sigstore transparency entry: 1338665503
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: peteromallet/dataclaw@150995378c5e8e942f91108c42936d22846867f5
- Branch / Tag: refs/heads/main
- Owner: https://github.com/peteromallet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@150995378c5e8e942f91108c42936d22846867f5
- Trigger Event: push

File details

Details for the file dataclaw-0.4.1-py3-none-any.whl.

File metadata

Download URL: dataclaw-0.4.1-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 84.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataclaw-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2ff0b77378402057051168729a250b995f9642a0db239b48839ebccdb7e86eb8`
MD5	`7ef80eb0eac91ffc7878d0c5088176c1`
BLAKE2b-256	`8fde7d0ebb8f6bb321297445bf3826efc0bba1d461fcbfcffc6da4661b6800fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataclaw-0.4.1-py3-none-any.whl:

Publisher: publish.yml on peteromallet/dataclaw

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataclaw-0.4.1-py3-none-any.whl
- Subject digest: 2ff0b77378402057051168729a250b995f9642a0db239b48839ebccdb7e86eb8
- Sigstore transparency entry: 1338665527
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: peteromallet/dataclaw@150995378c5e8e942f91108c42936d22846867f5
- Branch / Tag: refs/heads/main
- Owner: https://github.com/peteromallet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@150995378c5e8e942f91108c42936d22846867f5
- Trigger Event: push

dataclaw 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataClaw

Give this to your agent

Manual usage (without an agent)

Commands

What gets exported

Privacy & Redaction

Data schema

Finding datasets on Hugging Face

Contributing

Code Quality

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance