claude-converter

A zero-dependency Python module for inspecting and converting Claude Code session files (.jsonl) into the messages format expected by Hugging Face Transformers.

These details have not been verified by PyPI

Project links

Project description

A zero-dependency Python module for inspecting and converting Claude Code session files (.jsonl) into the messages format expected by Hugging Face Transformers.

Claude Code stores every session as a JSONL file on disk under ~/.claude/projects/<encoded-project-path>/<session-uuid>.jsonl. Each line is a JSON record containing the full message history — user prompts, assistant responses, tool calls, tool results, and extended thinking blocks. This module parses that format and flattens it into the simple [{"role": ..., "content": ...}] list that tokenizer.apply_chat_template() consumes directly.

Why this exists

The conversion pipeline has two stages that are often conflated:

Claude Code JSONL  →  [claude_converter]  →  messages[]  →  apply_chat_template()  →  tokens
                         (this module)                        (transformers handles this)

apply_chat_template() handles stage 2 (turning a messages list into model-specific token sequences), but it knows nothing about Claude Code's JSONL format. This module handles stage 1.

Installation

uv pip install claude-converter

Quick start

from claude_converter import session_to_messages

messages = session_to_messages("path/to/session.jsonl")

messages is ready for apply_chat_template().

API reference

`load_session(path)`

Loads a Claude Code .jsonl file and returns the raw list of record dicts — one per line.

from claude_converter import load_session

records = load_session("session.jsonl")
# [{"type": "user", "uuid": "...", "message": {...}, ...}, ...]

Raises FileNotFoundError if the file doesn't exist, ValueError if the extension is not .jsonl or .json, or if the file contains no valid records.

`session_to_messages(path, output=None)`

Loads a session and converts it to a messages list in one call. Optionally saves the result to a JSON file.

from claude_converter import session_to_messages

# In-memory only
messages = session_to_messages("session.jsonl")

# Save to disk too
messages = session_to_messages("session.jsonl", output="messages.json")

Returns list[{"role": str, "content": str}]. Content blocks (tool_use, tool_result, thinking, text) are flattened to plain text using XML-style tags so the conversation structure is preserved.

`records_to_messages(records)`

Converts an already-loaded list of records into the messages format. Useful when you want to load once and run multiple transformations without re-reading the file.

from claude_converter import load_session, records_to_messages

records  = load_session("session.jsonl")
messages = records_to_messages(records)

`inspect_session(path, show_flow=False, show_blocks=False, show_raw=False)`

Prints a color-coded report of a session: record type counts, content block breakdown by role, and token usage totals. The conversation flow is optional and off by default.

from claude_converter import inspect_session

# Stats only (default)
inspect_session("session.jsonl")

# Stats + timestamped conversation flow
inspect_session("session.jsonl", show_flow=True)

# Flow + content of every block inline
inspect_session("session.jsonl", show_flow=True, show_blocks=True)

# Full report: flow, blocks, and raw record structure examples
inspect_session("session.jsonl", show_flow=True, show_blocks=True, show_raw=True)

Example output (default):

Using the output with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from claude_converter import session_to_messages

MODEL_ID = "your-model-id"

messages  = session_to_messages("session.jsonl", output="messages.json")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

tokens = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)

Fine-tuning local models

Claude Code sessions are a natural source of training data: they capture real coding conversations (tool calls, reasoning traces, multi-turn edits) in a format that maps directly to the messages list expected by every major fine-tuning framework.

Preprocessing recommendations

Before training, filter the messages list for your target model:

Strip tool blocks: most local models don't understand <tool_use> / <tool_result> tags. Remove or replace them unless you are training a tool-calling model.
Strip thinking blocks: <thinking> blocks leak chain-of-thought that may not generalize. Keep them only if you are distilling reasoning behavior.
Drop short or empty turns: single-word assistant replies and empty user turns add noise.

def clean_messages(messages, keep_tool_calls=False, keep_thinking=False):
    import re
    cleaned = []
    for msg in messages:
        content = msg["content"]
        if not keep_thinking:
            content = re.sub(r"<thinking>.*?</thinking>", "", content, flags=re.DOTALL)
        if not keep_tool_calls:
            content = re.sub(r"<tool_use.*?>.*?</tool_use>", "", content, flags=re.DOTALL)
            content = re.sub(r"<tool_result>.*?</tool_result>", "", content, flags=re.DOTALL)
        content = content.strip()
        if content:
            cleaned.append({"role": msg["role"], "content": content})
    return cleaned

TRL / SFTTrainer example

from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from claude_converter import session_to_messages
import glob

MODEL_ID = "your-model-id"

all_messages = []
for path in glob.glob("~/.claude/projects/**/*.jsonl", recursive=True):
    msgs = session_to_messages(path)
    msgs = clean_messages(msgs)
    if len(msgs) >= 2:
        all_messages.append({"messages": msgs})

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model     = AutoModelForCausalLM.from_pretrained(MODEL_ID)
dataset   = Dataset.from_list(all_messages)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=4096),
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

Axolotl / LLaMA-Factory

Both frameworks accept a sharegpt format, which is structurally identical to the messages list this module produces. Save the cleaned messages list with output="messages.json" and point your framework config at that file.

Content block mapping

Claude Code sessions contain several block types that have no direct equivalent in the Transformers messages format. This module flattens them as follows:

Claude Code block	Flattened as
`text`	plain text, as-is
`thinking`	`<thinking>...</thinking>`
`tool_use`	`<tool_use name='...'>{input JSON}</tool_use>`
`tool_result`	`<tool_result>...</tool_result>`
`system` records	skipped (not included in messages output)

Records with an empty content string after flattening are also skipped.

Where Claude Code stores sessions

~/.claude/projects/
└── <url-encoded-project-path>/
    └── <session-uuid>.jsonl

The project path is URL-encoded: /home/user/myapp becomes -home-user-myapp. Each session is a separate file, append-only, one JSON object per line.

Limitations

Graph structure: Claude Code sessions are directed acyclic graphs linked by parentUuid. This module reads lines in file order (linear), which is correct for the vast majority of sessions. Branched or multi-agent sessions may need custom traversal.
Tool call fidelity: Most local models don't natively understand <tool_use> or <tool_result> tags. For inference-only use cases, consider stripping those blocks before passing to apply_chat_template().
No streaming: The module loads the full file into memory. For very large sessions, use load_session() and process records in batches.
Data quality for fine-tuning: Sessions include failed attempts, retries, and exploratory tool calls. Blindly training on raw sessions will teach the model bad habits. Filter aggressively: keep only sessions where the final assistant turn solves the stated problem, and prefer sessions with a high ratio of text blocks to tool_use blocks unless tool-calling is your training target.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.0

Jun 27, 2026

1.0.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_converter-1.2.0.tar.gz (13.1 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

claude_converter-1.2.0-py3-none-any.whl (13.1 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file claude_converter-1.2.0.tar.gz.

File metadata

Download URL: claude_converter-1.2.0.tar.gz
Upload date: Jun 27, 2026
Size: 13.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for claude_converter-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c0e4f717a308d3a1d0c430e409eb6dcacedaba0644562de1b59b405d86abee75`
MD5	`626e373aaf60e5625b6fa0035f57f928`
BLAKE2b-256	`ce63089862c83908a462c061c288c9bee5466452322635a6eec7ff7bebe4c6c9`

See more details on using hashes here.

File details

Details for the file claude_converter-1.2.0-py3-none-any.whl.

File metadata

Download URL: claude_converter-1.2.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for claude_converter-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12a95ca457ce4ee6c8710cca50475962928fb960353294f9bb9f2c3a093a0cf9`
MD5	`65956bca09bd6bf80fa8c2203f0ef84f`
BLAKE2b-256	`7b737ba1589f59a95423639c8ebcb25cbf061508910b06336df960c69f53c694`

See more details on using hashes here.

claude-converter 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why this exists

Installation

Quick start

API reference

load_session(path)

session_to_messages(path, output=None)

records_to_messages(records)

inspect_session(path, show_flow=False, show_blocks=False, show_raw=False)

Using the output with Transformers

Fine-tuning local models

Preprocessing recommendations

TRL / SFTTrainer example

Axolotl / LLaMA-Factory

Content block mapping

Where Claude Code stores sessions

Limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`load_session(path)`

`session_to_messages(path, output=None)`

`records_to_messages(records)`

`inspect_session(path, show_flow=False, show_blocks=False, show_raw=False)`