A zero-dependency Python module for inspecting and converting Claude Code session files (.jsonl) into the messages format expected by Hugging Face Transformers.
Project description
A zero-dependency Python module for inspecting and converting Claude Code session files (.jsonl) into the messages format expected by Hugging Face Transformers.
Claude Code stores every session as a JSONL file on disk under ~/.claude/projects/<encoded-project-path>/<session-uuid>.jsonl. Each line is a JSON record containing the full message history — user prompts, assistant responses, tool calls, tool results, and extended thinking blocks. This module parses that format and flattens it into the simple [{"role": ..., "content": ...}] list that tokenizer.apply_chat_template() consumes directly.
Why this exists
The conversion pipeline has two stages that are often conflated:
Claude Code JSONL → [claude_converter] → messages[] → apply_chat_template() → tokens
(this module) (transformers handles this)
apply_chat_template() handles stage 2 (turning a messages list into model-specific token sequences), but it knows nothing about Claude Code's JSONL format. This module handles stage 1.
Installation
uv pip install claude-converter
Quick start
from claude_converter import session_to_messages
messages = session_to_messages("path/to/session.jsonl")
messages is ready for apply_chat_template().
API reference
load_session(path)
Loads a Claude Code .jsonl file and returns the raw list of record dicts — one per line.
from claude_converter import load_session
records = load_session("session.jsonl")
# [{"type": "user", "uuid": "...", "message": {...}, ...}, ...]
Raises FileNotFoundError if the file doesn't exist, ValueError if the extension is not .jsonl or .json, or if the file contains no valid records.
session_to_messages(path, output=None)
Loads a session and converts it to a messages list in one call. Optionally saves the result to a JSON file.
from claude_converter import session_to_messages
# In-memory only
messages = session_to_messages("session.jsonl")
# Save to disk too
messages = session_to_messages("session.jsonl", output="messages.json")
Returns list[{"role": str, "content": str}]. Content blocks (tool_use, tool_result, thinking, text) are flattened to plain text using XML-style tags so the conversation structure is preserved.
records_to_messages(records)
Converts an already-loaded list of records into the messages format. Useful when you want to load once and run multiple transformations without re-reading the file.
from claude_converter import load_session, records_to_messages
records = load_session("session.jsonl")
messages = records_to_messages(records)
inspect_session(path, show_flow=False, show_blocks=False, show_raw=False)
Prints a color-coded report of a session: record type counts, content block breakdown by role, and token usage totals. The conversation flow is optional and off by default.
from claude_converter import inspect_session
# Stats only (default)
inspect_session("session.jsonl")
# Stats + timestamped conversation flow
inspect_session("session.jsonl", show_flow=True)
# Flow + content of every block inline
inspect_session("session.jsonl", show_flow=True, show_blocks=True)
# Full report: flow, blocks, and raw record structure examples
inspect_session("session.jsonl", show_flow=True, show_blocks=True, show_raw=True)
Example output (default):
Using the output with Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from claude_converter import session_to_messages
MODEL_ID = "your-model-id"
messages = session_to_messages("session.jsonl", output="messages.json")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokens = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
Fine-tuning local models
Claude Code sessions are a natural source of training data: they capture
real coding conversations (tool calls, reasoning traces, multi-turn edits)
in a format that maps directly to the messages list expected by every major
fine-tuning framework.
Preprocessing recommendations
Before training, filter the messages list for your target model:
- Strip tool blocks: most local models don't understand
<tool_use>/<tool_result>tags. Remove or replace them unless you are training a tool-calling model. - Strip thinking blocks:
<thinking>blocks leak chain-of-thought that may not generalize. Keep them only if you are distilling reasoning behavior. - Drop short or empty turns: single-word assistant replies and empty user turns add noise.
def clean_messages(messages, keep_tool_calls=False, keep_thinking=False):
import re
cleaned = []
for msg in messages:
content = msg["content"]
if not keep_thinking:
content = re.sub(r"<thinking>.*?</thinking>", "", content, flags=re.DOTALL)
if not keep_tool_calls:
content = re.sub(r"<tool_use.*?>.*?</tool_use>", "", content, flags=re.DOTALL)
content = re.sub(r"<tool_result>.*?</tool_result>", "", content, flags=re.DOTALL)
content = content.strip()
if content:
cleaned.append({"role": msg["role"], "content": content})
return cleaned
TRL / SFTTrainer example
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from claude_converter import session_to_messages
import glob
MODEL_ID = "your-model-id"
all_messages = []
for path in glob.glob("~/.claude/projects/**/*.jsonl", recursive=True):
msgs = session_to_messages(path)
msgs = clean_messages(msgs)
if len(msgs) >= 2:
all_messages.append({"messages": msgs})
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
dataset = Dataset.from_list(all_messages)
trainer = SFTTrainer(
model=model,
args=SFTConfig(output_dir="./output", max_seq_length=4096),
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
Axolotl / LLaMA-Factory
Both frameworks accept a sharegpt format, which is structurally identical to
the messages list this module produces. Save the cleaned messages list with
output="messages.json" and point your framework config at that file.
Content block mapping
Claude Code sessions contain several block types that have no direct equivalent in the Transformers messages format. This module flattens them as follows:
| Claude Code block | Flattened as |
|---|---|
text |
plain text, as-is |
thinking |
<thinking>...</thinking> |
tool_use |
<tool_use name='...'>{input JSON}</tool_use> |
tool_result |
<tool_result>...</tool_result> |
system records |
skipped (not included in messages output) |
Records with an empty content string after flattening are also skipped.
Where Claude Code stores sessions
~/.claude/projects/
└── <url-encoded-project-path>/
└── <session-uuid>.jsonl
The project path is URL-encoded: /home/user/myapp becomes -home-user-myapp. Each session is a separate file, append-only, one JSON object per line.
Limitations
- Graph structure: Claude Code sessions are directed acyclic graphs linked by
parentUuid. This module reads lines in file order (linear), which is correct for the vast majority of sessions. Branched or multi-agent sessions may need custom traversal. - Tool call fidelity: Most local models don't natively understand
<tool_use>or<tool_result>tags. For inference-only use cases, consider stripping those blocks before passing toapply_chat_template(). - No streaming: The module loads the full file into memory. For very large sessions, use
load_session()and process records in batches. - Data quality for fine-tuning: Sessions include failed attempts, retries, and exploratory tool calls. Blindly training on raw sessions will teach the model bad habits. Filter aggressively: keep only sessions where the final assistant turn solves the stated problem, and prefer sessions with a high ratio of
textblocks totool_useblocks unless tool-calling is your training target.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file claude_converter-1.2.0.tar.gz.
File metadata
- Download URL: claude_converter-1.2.0.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0e4f717a308d3a1d0c430e409eb6dcacedaba0644562de1b59b405d86abee75
|
|
| MD5 |
626e373aaf60e5625b6fa0035f57f928
|
|
| BLAKE2b-256 |
ce63089862c83908a462c061c288c9bee5466452322635a6eec7ff7bebe4c6c9
|
File details
Details for the file claude_converter-1.2.0-py3-none-any.whl.
File metadata
- Download URL: claude_converter-1.2.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12a95ca457ce4ee6c8710cca50475962928fb960353294f9bb9f2c3a093a0cf9
|
|
| MD5 |
65956bca09bd6bf80fa8c2203f0ef84f
|
|
| BLAKE2b-256 |
7b737ba1589f59a95423639c8ebcb25cbf061508910b06336df960c69f53c694
|