A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.
Project description
safedata-guard
A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.
Most "chat with your data" tools send your whole dataset to the model and then run whatever code it writes, unchecked. safedata-guard sits between the AI and your data and changes both halves of that: it sends a compact, quality-aware summary instead of raw rows (cheaper, and it keeps sensitive values out of the prompt), and it runs the AI's code behind guardrails on a copy of your data.
At a glance, it gives you:
- a quality-aware summary of any DataFrame, with warnings about common data traps, used as a cheap prompt instead of raw rows
- best-effort PII masking of sample values before they reach the model
- a guardrail layer that screens and runs AI-written code on a copy, with a timeout, and feeds errors back so the model can correct itself
check_code()to screen code without running it, for use in your own agent- works with pandas or Polars, and a
safedatacommand-line tool - honest about its limits: defense in depth, not a security sandbox
What it does
1. It summarises your data before sending it to an AI.
Instead of pushing 100,000 rows into a prompt (slow and expensive), it sends a short summary: the columns, their types, a few sample values, and basic stats. The summary also flags common data problems that trip up analysis, such as:
- numbers stored as text (like "$500" or "1,000")
- the same category written several ways (like "California" and "CA")
- ID columns that are not actually unique
- columns that are completely empty
- columns that are mostly empty
- Excel dates stored as plain numbers (like 45292)
- negative values in columns that should not have them
2. It runs the AI's code with guardrails.
When the AI writes code to answer a question, safedata-guard runs it on a copy of your data, in a separate process with a timeout. Before running, an AST-based screen refuses unsafe imports, introspection/dunder tricks, dangerous builtins, and data/file readers and writers (a small set of analysis imports like pandas, numpy, and datetime is allowed). The AI may add or transform columns freely (it works on the copy), but after running, the bodyguard checks that it did not silently drop rows from the data and that the result is not silently empty, and if something looks wrong it sends the error back so the AI can fix its own code.
Scope: please read this honestly. This is defense in depth for cooperative / semi-trusted model output: it stops the destructive accidents an honest model makes, and the obvious escape attempts. It is not a security sandbox for deliberately malicious untrusted code. In-process Python "sandboxes" have a long history of clever escapes, and on Windows a child process still shares your filesystem permissions; so the subprocess gives you timeout and crash isolation, not a filesystem jail. If you need to run code from an untrusted source, run safedata-guard inside OS-level isolation (a container, a locked-down user account, or a VM). It also does not prove the AI's maths is correct, which no tool can do in general.
Install
pip install safedata-guard
Using Polars instead of pandas
safedata-guard works with either pandas or Polars DataFrames. The safety screen, the copy-and-isolate execution, and the data-trap summary all handle both. To use Polars, install the extra:
pip install "safedata-guard[polars]"
Then pass a Polars frame anywhere you would pass a pandas one; the library
detects the type. The safety screen blocks Polars' file writers and readers
(write_csv, write_parquet, lazy sink_*, read_*, scan_*) the same way
it blocks the pandas equivalents. The scope note above applies identically: it
is defense in depth for cooperative model output, not a sandbox for malicious
code.
PII masking in the summary
The summary sends a few real sample values to the LLM so it can write correct code. Those samples can contain personal data. By default, safedata-guard masks obvious PII (emails, card-like numbers, phones, SSNs, IPs) before the summary leaves your machine, and notes which columns were masked.
safedata.summarize(df) # PII masked by default
safedata.summarize(df, redact_pii=False) # raw samples, if you are sure
This is best-effort, regex-based redaction, not a compliance guarantee. It catches common well-formed patterns and will miss unusual formats, names, addresses, and free-text. If you handle regulated data, keep it out of third-party LLMs by policy; do not rely on a regex. Masking is on because leaking less is better than leaking more, but treat it as a seatbelt, not a vault.
Command line: check a file in one line
After installing, you get a safedata command. Point it at a data file to see
the quality summary, data-trap warnings, and token-saving estimate without
writing any Python:
safedata check sales.csv
safedata check data.xlsx --report quality.html
safedata check sales.csv --no-redact --samples 5
Supported file types: .csv, .tsv, .xlsx, .xls, .parquet, .json.
The command only reads and summarises the file; it never executes code. --report
also writes the HTML quality report, and --no-redact shows raw sample values
instead of masking detected PII.
Quick example
This runs end to end today. The my_model function below returns the code as a
string. In a real project you would replace its body with a call to a model of
your choice.
import safedata
import pandas as pd
df = pd.DataFrame({
"date": ["2025-01-01", "2024-05-01", "2025-08-01"],
"amount": [100.0, 50.0, 200.0],
})
def my_model(prompt):
# Replace this with a call to your own model.
# It should take the prompt text and return Python code as a string.
return "result = df[df['date'].str.startswith('2025')]['amount'].sum()"
agent = safedata.Agent(model=my_model)
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer) # 300.0
print(out.blocked) # False if the code passed the safety checks
print(out.attempts) # the list of code attempts that were made
Connecting a real model with wrap()
Real models are messy. They wrap code in Markdown code fences, add
sentences like "Here is the code:", and sometimes fail because of a bad key or
no internet. safedata.wrap() takes any function that sends text to a model and
returns the model's text, and it handles the messy parts for you: pulling the
bare code out of the reply and turning failures into a clear message instead of
a crash.
You write a small function that calls your model. It does not matter which model it is: a hosted one like Claude or GPT-4, a model running on your own machine, or your own custom function. As long as it takes text and returns text, it works.
import safedata
# Example shape of your own model call. Replace the body with your model.
def my_call(prompt):
reply = some_model_that_takes_text_and_returns_text(prompt)
return reply
agent = safedata.Agent(model=safedata.wrap(my_call))
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer)
Because wrap works with any text-in, text-out function, the library is not
tied to a single provider. You can point it at whatever model you already have.
A note on quality: the library connects to any model, but the quality of the answers depends on the model. A strong model writes good code on the first try. A weaker model may write code that gets blocked by the safety checks and then retried. The library stays safe either way, but a better model means fewer retries.
Use the parts on their own
print(safedata.summarize(df)) # the data summary with warnings
result = safedata.run_safely(code_string, df) # run code through the safety layer
verdict = safedata.check_code(code_string) # is this code safe? (does NOT run it)
safedata.report(df, "report.html") # write an HTML quality report
print(safedata.token_savings(df)) # estimated token (cost) saving
Token and cost saving
Sending a whole table to a model can cost a lot, because every row becomes
tokens that you pay for. safedata sends a short summary instead, which is far
smaller. safedata.token_savings(df) shows the estimated saving in plain words:
print(safedata.token_savings(df))
# Sending the summary uses about 620 tokens instead of about 13,007,168 for the
# raw data. Estimated saving: 99.99% (about 13,006,548 tokens).
Every agent.ask(...) result also carries a tokens estimate you can inspect:
out = agent.ask(df, "What were total sales in 2025?")
print(out.tokens) # {'summary_tokens': ..., 'raw_tokens': ..., 'saved_percent': ...}
These numbers are estimates. Each model provider counts tokens with its own method, so exact figures vary, but the estimate shows the scale of the saving.
HTML report
safedata.report(df, "report.html") writes a simple web page that lists each
column with a red, amber or green status, the problems it found, and suggested
fixes. It is meant to be readable by someone who does not write code. Call it
without a path to get the HTML back as a string instead.
How the question-answering loop works
- The data is summarised, including the trap warnings.
- Your model writes code based on the summary and the question.
- The code runs on a copy and is checked for safety.
- If it is blocked, the error is sent back and the model tries again.
About the model
The library does not include a model. You supply one through a single function, so you can use a local model or a hosted one without changing anything else in your code.
Full function reference
Everything the library makes available:
Asking questions
safedata.Agent(model, max_retries=3)builds an agent.modelis a function that takes a prompt and returns code (usually made withwrap).max_retriesis how many times the model may correct itself after a block.agent.ask(df, question, verbose=False)runs the full loop and returns a result object. Setverbose=Trueto print each code attempt as it happens.- The result object has:
.answer(the result),.blocked(True if it could not be completed safely),.reason(why it was blocked, if so),.attempts(the list of code attempts), and.tokens(the token estimate for the call).
Connecting a model
safedata.wrap(call, clean=...)turns any text-in, text-out function into a model the agent can use. It strips code out of messy replies and turns failures into a clearModelError.safedata.extract_code(text)is the helper that pulls bare Python code out of a reply (handling Markdown code fences and chatter).wrapuses it by default; you can call it yourself or pass your own version towrap.safedata.ModelErroris the error raised when a wrapped model call fails (bad key, no internet, unusable output).
Looking at the data
safedata.summarize(df)returns the short text summary, including the data trap warnings. This is what gets sent to the model.safedata.report(df, path=None)writes an HTML quality report topath, or returns the HTML as a string if no path is given.
Running code safely on its own
safedata.run_safely(code, df, result_var="result")runs a piece of code against a copy ofdf, blocks unsafe operations, checks that nothing was damaged, and returns the value of the result variable. RaisesSafetyErrorif the code is unsafe.safedata.SafetyErroris the error raised when code is blocked.
Token and cost estimates
safedata.token_savings(df)returns a readable sentence describing the estimated token saving.safedata.token_stats(df)returns the raw numbers as a dictionary:summary_tokens,raw_tokens,saved_tokens,saved_percent.safedata.estimate_tokens(text)estimates the number of tokens in any piece of text, using a rough rule of about four characters per token.
All token figures are estimates. Each model provider counts tokens with its own method, so exact numbers vary, but the estimate shows the scale of the saving.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safedata_guard-1.0.1.tar.gz.
File metadata
- Download URL: safedata_guard-1.0.1.tar.gz
- Upload date:
- Size: 38.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbf8d2e431b8835a9d46fa7ad141c67421a85cef9be6a7e728ff4b81caf7526f
|
|
| MD5 |
436457a7d945aab3975f4f8111d2af28
|
|
| BLAKE2b-256 |
6c22f8b59c21e9c384cd4de3fba6002feea329a207be26e3485e719a28416078
|
File details
Details for the file safedata_guard-1.0.1-py3-none-any.whl.
File metadata
- Download URL: safedata_guard-1.0.1-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c18df57bd4fc757d5a3064d79dd52c5f69af5f777eb95d2f33c573d34b12802
|
|
| MD5 |
881a9c2ff41f678191e49afafab61828
|
|
| BLAKE2b-256 |
f0cb764114fd8f0bb01fe9b33ca38fc27da525332e2b3b87425eb74435d538a8
|