A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.
Project description
safedata-guard
A lightweight framework for safely letting LLMs analyze pandas/Polars data without exposing raw data or blindly running the code they generate.
Most "chat with your data" tools send your whole table to the model and run whatever code it writes, unchecked. safedata-guard changes both halves: it sends a compact, quality-aware summary instead of raw rows, and it runs the model's code behind guardrails on a copy of your data.
What it does
1. Summarises your data before it reaches the model. Instead of pushing 100,000 rows into a prompt, it sends the columns, their types, a few sample values, basic stats, and warnings about common data traps:
- numbers stored as text (
"$500","1,000") - the same category written several ways (
"North","north ","NORTH") - dates stored as text, or Excel serial dates stored as plain numbers (
45292) - ID columns that are not actually unique
- columns that are completely empty, or mostly empty
- columns that hold the same value in every row
- duplicated column names
- negative values in columns whose names imply they should not have any
2. Runs the model's code with guardrails. Before running, an AST-based screen refuses imports outside a small analysis set (pandas, numpy, math, statistics, datetime, re), introspection/dunder tricks, dangerous builtins, and file/data readers and writers. The code then runs on a copy of your data in a separate process with a timeout. The model may add or transform columns freely (it only touches the copy), but afterwards the guardrail checks that it did not silently drop rows or return an empty result, and feeds any error back so the model can fix its own code.
Scope: please read this honestly
This is defense in depth for cooperative or semi-trusted model output: it stops the destructive accidents an honest model makes and the obvious escape attempts. It is not a security sandbox for deliberately malicious code. In-process Python sandboxes have a long history of clever escapes, and on Windows a child process still shares your filesystem permissions, so the subprocess gives you timeout and crash isolation, not a filesystem jail. To run code from an untrusted source, put safedata-guard inside OS-level isolation (a container, a locked-down user, or a VM). It also cannot prove the model's maths is correct, which no tool can do in general.
Install
pip install safedata-guard
pip install "safedata-guard[polars]" # optional, for Polars support
Pass a pandas or Polars DataFrame anywhere; the library detects the type and applies the same summary and safety checks to both.
Quick example
import safedata
import pandas as pd
df = pd.DataFrame({
"date": ["2025-01-01", "2024-05-01", "2025-08-01"],
"amount": [100.0, 50.0, 200.0],
})
def my_model(prompt):
# Replace this with a call to your own model: take the prompt text,
# return Python code as a string.
return "result = df[df['date'].str.startswith('2025')]['amount'].sum()"
agent = safedata.Agent(model=my_model)
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer) # 300.0
print(out.blocked) # False if the code passed the safety checks
print(out.attempts) # list of code attempts that were made
Connecting a real model
Real models return messy text: code wrapped in Markdown fences, chatter like
"Here is the code:", and occasional failures. safedata.wrap() takes any
function that sends text to a model and returns text, pulls the bare code out of
the reply, and turns failures into a clear ModelError instead of a crash. Any
text-in/text-out function works, hosted or local, so you are not tied to one
provider.
import safedata
def my_call(prompt):
return some_model_that_takes_text_and_returns_text(prompt)
agent = safedata.Agent(model=safedata.wrap(my_call))
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer)
The library stays safe with any model; a stronger model just means good code on the first try and fewer retries.
Command line
After installing you get a safedata command that summarises a file (quality
warnings and token estimate) without writing any Python. It only reads and
summarises; it never executes code.
safedata check sales.csv
safedata check data.xlsx --report quality.html
safedata check sales.csv --no-redact --samples 5
Supported: .csv, .tsv, .xlsx, .xls, .parquet, .json. --report
writes the HTML report; --no-redact shows raw samples instead of masking PII.
You can also run it as python -m safedata check sales.csv if the command is
not on your PATH.
PII masking
The summary sends a few real sample values to the model, and those can contain personal data. By default safedata-guard masks obvious PII (emails, card-like numbers, phones, SSNs, IPs) before the summary leaves your machine and notes which columns were masked.
safedata.summarize(df) # PII masked by default
safedata.summarize(df, redact_pii=False) # raw samples, if you are sure
This is best-effort, regex-based redaction, not a compliance guarantee. It catches common patterns and will miss unusual formats, names, addresses, and free text. For regulated data, keep it out of third-party LLMs by policy rather than relying on a regex. Treat masking as a seatbelt, not a vault.
Use the parts on their own
print(safedata.summarize(df)) # quality-aware summary
result = safedata.run_safely(code_string, df) # run code through the guardrails
verdict = safedata.check_code(code_string) # is this code safe? (does NOT run it)
safedata.report(df, "report.html") # HTML quality report
print(safedata.token_savings(df)) # estimated token (cost) saving
check_code(code) returns a result with .safe (bool) and .reason, using the
same screen as run_safely but without executing anything, so you can use it as
a guardrail inside your own agent loop.
Token saving
Sending a whole table costs tokens for every row; the summary is far smaller.
print(safedata.token_savings(df))
# Sending the summary uses about 620 tokens instead of about 13,007,168 for the
# raw data. Estimated saving: 99.99% (about 13,006,548 tokens).
Every agent.ask(...) result also carries a .tokens estimate. All token
figures are estimates (each provider counts differently) but show the scale of
the saving.
Function reference
Asking questions
safedata.Agent(model, max_retries=3)builds an agent.modeltakes a prompt and returns code (usually made withwrap);max_retriesis how many times the model may correct itself after a block.agent.ask(df, question, verbose=False)runs the full loop and returns a result with.answer,.blocked,.reason,.attempts, and.tokens.
Connecting a model
safedata.wrap(call, clean=...)turns any text-in/text-out function into a model the agent can use, stripping messy replies and raisingModelErroron failure.safedata.extract_code(text)pulls bare Python code out of a reply (handles Markdown fences and chatter).wrapuses it by default.safedata.ModelErroris raised when a wrapped model call fails.
Looking at the data
safedata.summarize(df, redact_pii=True)returns the text summary with trap warnings. This is what gets sent to the model.safedata.report(df, path=None)writes an HTML quality report topath, or returns the HTML string if no path is given.
Running code safely
safedata.run_safely(code, df, result_var="result", isolate=True, timeout=10.0)runs code against a copy ofdf, blocks unsafe operations, checks nothing was damaged, and returns the result variable. RaisesSafetyErrorif unsafe.safedata.check_code(code)screens code without running it; returns aCodeCheck(.safe,.reason).safedata.SafetyErroris raised when code is blocked.
Token estimates
safedata.token_savings(df)returns a readable sentence.safedata.token_stats(df)returnssummary_tokens,raw_tokens,saved_tokens,saved_percent.safedata.estimate_tokens(text)estimates tokens for any text (~4 chars each).
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safedata_guard-1.0.4-py3-none-any.whl.
File metadata
- Download URL: safedata_guard-1.0.4-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72b3c2303c6b253ceca1fe60da6d55ccc8d18bf435be52caec134ee20baf610a
|
|
| MD5 |
a8191ca1f0094c774cfd8fc983b255f8
|
|
| BLAKE2b-256 |
f68ff3bc4b0c88218e24bd226ffd424b97d2bbabab7c1aba59cee1ae586d37f4
|