Skip to main content

A zero-dependency Python module for inspecting and converting Claude Code session files (.jsonl) into the messages format expected by Hugging Face Transformers.

Project description

Logo

A zero-dependency Python module for inspecting and converting Claude Code session files (.jsonl) into the messages format expected by Hugging Face Transformers.

Claude Code stores every session as a JSONL file on disk under ~/.claude/projects/<encoded-project-path>/<session-uuid>.jsonl. Each line is a JSON record containing the full message history — user prompts, assistant responses, tool calls, tool results, and extended thinking blocks. This module parses that format and flattens it into the simple [{"role": ..., "content": ...}] list that tokenizer.apply_chat_template() consumes directly.

Why this exists

The conversion pipeline has two stages that are often conflated:

Claude Code JSONL  →  [claude_converter]  →  messages[]  →  apply_chat_template()  →  tokens
                         (this module)                        (transformers handles this)

apply_chat_template() handles stage 2 (turning a messages list into model-specific token sequences), but it knows nothing about Claude Code's JSONL format. This module handles stage 1.

Installation

uv pip install claude-converter

Quick start

from claude_converter import session_to_messages

messages = session_to_messages("path/to/session.jsonl")

messages is ready for apply_chat_template().

API reference

load_session(path)

Loads a Claude Code .jsonl file and returns the raw list of record dicts — one per line.

from claude_converter import load_session

records = load_session("session.jsonl")
# [{"type": "user", "uuid": "...", "message": {...}, ...}, ...]

Raises FileNotFoundError if the file doesn't exist, ValueError if the extension is not .jsonl or .json, or if the file contains no valid records.

session_to_messages(path, output=None)

Loads a session and converts it to a messages list in one call. Optionally saves the result to a JSON file.

from claude_converter import session_to_messages

# In-memory only
messages = session_to_messages("session.jsonl")

# Save to disk too
messages = session_to_messages("session.jsonl", output="messages.json")

Returns list[{"role": str, "content": str}]. Content blocks (tool_use, tool_result, thinking, text) are flattened to plain text using XML-style tags so the conversation structure is preserved.

records_to_messages(records)

Converts an already-loaded list of records into the messages format. Useful when you want to load once and run multiple transformations without re-reading the file.

from claude_converter import load_session, records_to_messages

records  = load_session("session.jsonl")
messages = records_to_messages(records)

inspect_session(path, show_flow=False, show_blocks=False, show_raw=False)

Prints a color-coded report of a session: record type counts, content block breakdown by role, and token usage totals. The conversation flow is optional and off by default.

from claude_converter import inspect_session

# Stats only (default)
inspect_session("session.jsonl")

# Stats + timestamped conversation flow
inspect_session("session.jsonl", show_flow=True)

# Flow + content of every block inline
inspect_session("session.jsonl", show_flow=True, show_blocks=True)

# Full report: flow, blocks, and raw record structure examples
inspect_session("session.jsonl", show_flow=True, show_blocks=True, show_raw=True)

Example output (default):

Logo

Using the output with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from claude_converter import session_to_messages

MODEL_ID = "your-model-id"

messages  = session_to_messages("session.jsonl", output="messages.json")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

tokens = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)

Fine-tuning local models

Claude Code sessions are a natural source of training data: they capture real coding conversations (tool calls, reasoning traces, multi-turn edits) in a format that maps directly to the messages list expected by every major fine-tuning framework.

Preprocessing recommendations

Before training, filter the messages list for your target model:

  • Strip tool blocks: most local models don't understand <tool_use> / <tool_result> tags. Remove or replace them unless you are training a tool-calling model.
  • Strip thinking blocks: <thinking> blocks leak chain-of-thought that may not generalize. Keep them only if you are distilling reasoning behavior.
  • Drop short or empty turns: single-word assistant replies and empty user turns add noise.
def clean_messages(messages, keep_tool_calls=False, keep_thinking=False):
    import re
    cleaned = []
    for msg in messages:
        content = msg["content"]
        if not keep_thinking:
            content = re.sub(r"<thinking>.*?</thinking>", "", content, flags=re.DOTALL)
        if not keep_tool_calls:
            content = re.sub(r"<tool_use.*?>.*?</tool_use>", "", content, flags=re.DOTALL)
            content = re.sub(r"<tool_result>.*?</tool_result>", "", content, flags=re.DOTALL)
        content = content.strip()
        if content:
            cleaned.append({"role": msg["role"], "content": content})
    return cleaned

TRL / SFTTrainer example

from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from claude_converter import session_to_messages
import glob

MODEL_ID = "your-model-id"

all_messages = []
for path in glob.glob("~/.claude/projects/**/*.jsonl", recursive=True):
    msgs = session_to_messages(path)
    msgs = clean_messages(msgs)
    if len(msgs) >= 2:
        all_messages.append({"messages": msgs})

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model     = AutoModelForCausalLM.from_pretrained(MODEL_ID)
dataset   = Dataset.from_list(all_messages)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=4096),
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

Axolotl / LLaMA-Factory

Both frameworks accept a sharegpt format, which is structurally identical to the messages list this module produces. Save the cleaned messages list with output="messages.json" and point your framework config at that file.

Content block mapping

Claude Code sessions contain several block types that have no direct equivalent in the Transformers messages format. This module flattens them as follows:

Claude Code block Flattened as
text plain text, as-is
thinking <thinking>...</thinking>
tool_use <tool_use name='...'>{input JSON}</tool_use>
tool_result <tool_result>...</tool_result>
system records skipped (not included in messages output)

Records with an empty content string after flattening are also skipped.

Where Claude Code stores sessions

~/.claude/projects/
└── <url-encoded-project-path>/
    └── <session-uuid>.jsonl

The project path is URL-encoded: /home/user/myapp becomes -home-user-myapp. Each session is a separate file, append-only, one JSON object per line.

Limitations

  • Graph structure: Claude Code sessions are directed acyclic graphs linked by parentUuid. This module reads lines in file order (linear), which is correct for the vast majority of sessions. Branched or multi-agent sessions may need custom traversal.
  • Tool call fidelity: Most local models don't natively understand <tool_use> or <tool_result> tags. For inference-only use cases, consider stripping those blocks before passing to apply_chat_template().
  • No streaming: The module loads the full file into memory. For very large sessions, use load_session() and process records in batches.
  • Data quality for fine-tuning: Sessions include failed attempts, retries, and exploratory tool calls. Blindly training on raw sessions will teach the model bad habits. Filter aggressively: keep only sessions where the final assistant turn solves the stated problem, and prefer sessions with a high ratio of text blocks to tool_use blocks unless tool-calling is your training target.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_converter-1.2.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

claude_converter-1.2.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file claude_converter-1.2.0.tar.gz.

File metadata

  • Download URL: claude_converter-1.2.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for claude_converter-1.2.0.tar.gz
Algorithm Hash digest
SHA256 c0e4f717a308d3a1d0c430e409eb6dcacedaba0644562de1b59b405d86abee75
MD5 626e373aaf60e5625b6fa0035f57f928
BLAKE2b-256 ce63089862c83908a462c061c288c9bee5466452322635a6eec7ff7bebe4c6c9

See more details on using hashes here.

File details

Details for the file claude_converter-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for claude_converter-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 12a95ca457ce4ee6c8710cca50475962928fb960353294f9bb9f2c3a093a0cf9
MD5 65956bca09bd6bf80fa8c2203f0ef84f
BLAKE2b-256 7b737ba1589f59a95423639c8ebcb25cbf061508910b06336df960c69f53c694

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page