Skip to main content

A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.

Project description

safedata-guard

A lightweight framework for safely letting LLMs analyze pandas/Polars data without exposing raw data or blindly running the code they generate.

Most "chat with your data" tools send your whole table to the model and run whatever code it writes, unchecked. safedata-guard changes both halves: it sends a compact, quality-aware summary instead of raw rows, and it runs the model's code behind guardrails on a copy of your data.

What it does

1. Summarises your data before it reaches the model. Instead of pushing 100,000 rows into a prompt, it sends the columns, their types, a few sample values, basic stats, and warnings about common data traps:

  • numbers stored as text ("$500", "1,000")
  • the same category written several ways ("North", "north ", "NORTH")
  • dates stored as text, or Excel serial dates stored as plain numbers (45292)
  • ID columns that are not actually unique
  • columns that are completely empty, or mostly empty
  • columns that hold the same value in every row
  • duplicated column names
  • negative values in columns whose names imply they should not have any

2. Runs the model's code with guardrails. Before running, an AST-based screen refuses imports outside a small analysis set (pandas, numpy, math, statistics, datetime, re), introspection/dunder tricks, dangerous builtins, and file/data readers and writers. The code then runs on a copy of your data in a separate process with a timeout. The model may add or transform columns freely (it only touches the copy), but afterwards the guardrail checks that it did not silently drop rows or return an empty result, and feeds any error back so the model can fix its own code.

Scope: please read this honestly

This is defense in depth for cooperative or semi-trusted model output: it stops the destructive accidents an honest model makes and the obvious escape attempts. It is not a security sandbox for deliberately malicious code. In-process Python sandboxes have a long history of clever escapes, and on Windows a child process still shares your filesystem permissions, so the subprocess gives you timeout and crash isolation, not a filesystem jail. To run code from an untrusted source, put safedata-guard inside OS-level isolation (a container, a locked-down user, or a VM). It also cannot prove the model's maths is correct, which no tool can do in general.

Install

pip install safedata-guard
pip install "safedata-guard[polars]"   # optional, for Polars support

Pass a pandas or Polars DataFrame anywhere; the library detects the type and applies the same summary and safety checks to both.

Quick example

import safedata
import pandas as pd

df = pd.DataFrame({
    "date": ["2025-01-01", "2024-05-01", "2025-08-01"],
    "amount": [100.0, 50.0, 200.0],
})

def my_model(prompt):
    # Replace this with a call to your own model: take the prompt text,
    # return Python code as a string.
    return "result = df[df['date'].str.startswith('2025')]['amount'].sum()"

agent = safedata.Agent(model=my_model)
out = agent.ask(df, "What were total sales in 2025?")

print(out.answer)     # 300.0
print(out.blocked)    # False if the code passed the safety checks
print(out.attempts)   # list of code attempts that were made

Connecting a real model

Real models return messy text: code wrapped in Markdown fences, chatter like "Here is the code:", and occasional failures. safedata.wrap() takes any function that sends text to a model and returns text, pulls the bare code out of the reply, and turns failures into a clear ModelError instead of a crash. Any text-in/text-out function works, hosted or local, so you are not tied to one provider.

import safedata

def my_call(prompt):
    return some_model_that_takes_text_and_returns_text(prompt)

agent = safedata.Agent(model=safedata.wrap(my_call))
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer)

The library stays safe with any model; a stronger model just means good code on the first try and fewer retries.

Command line

After installing you get a safedata command that summarises a file (quality warnings and token estimate) without writing any Python. It only reads and summarises; it never executes code.

safedata check sales.csv
safedata check data.xlsx --report quality.html
safedata check sales.csv --no-redact --samples 5

Supported: .csv, .tsv, .xlsx, .xls, .parquet, .json. --report writes the HTML report; --no-redact shows raw samples instead of masking PII.

You can also run it as python -m safedata check sales.csv if the command is not on your PATH.

PII masking

The summary sends a few real sample values to the model, and those can contain personal data. By default safedata-guard masks obvious PII (emails, card-like numbers, phones, SSNs, IPs) before the summary leaves your machine and notes which columns were masked.

safedata.summarize(df)                    # PII masked by default
safedata.summarize(df, redact_pii=False)  # raw samples, if you are sure

This is best-effort, regex-based redaction, not a compliance guarantee. It catches common patterns and will miss unusual formats, names, addresses, and free text. For regulated data, keep it out of third-party LLMs by policy rather than relying on a regex. Treat masking as a seatbelt, not a vault.

Use the parts on their own

print(safedata.summarize(df))                  # quality-aware summary
result = safedata.run_safely(code_string, df)  # run code through the guardrails
verdict = safedata.check_code(code_string)     # is this code safe? (does NOT run it)
safedata.report(df, "report.html")             # HTML quality report
print(safedata.token_savings(df))              # estimated token (cost) saving

check_code(code) returns a result with .safe (bool) and .reason, using the same screen as run_safely but without executing anything, so you can use it as a guardrail inside your own agent loop.

Token saving

Sending a whole table costs tokens for every row; the summary is far smaller.

print(safedata.token_savings(df))
# Sending the summary uses about 620 tokens instead of about 13,007,168 for the
# raw data. Estimated saving: 99.99% (about 13,006,548 tokens).

Every agent.ask(...) result also carries a .tokens estimate. All token figures are estimates (each provider counts differently) but show the scale of the saving.

Function reference

Asking questions

  • safedata.Agent(model, max_retries=3) builds an agent. model takes a prompt and returns code (usually made with wrap); max_retries is how many times the model may correct itself after a block.
  • agent.ask(df, question, verbose=False) runs the full loop and returns a result with .answer, .blocked, .reason, .attempts, and .tokens.

Connecting a model

  • safedata.wrap(call, clean=...) turns any text-in/text-out function into a model the agent can use, stripping messy replies and raising ModelError on failure.
  • safedata.extract_code(text) pulls bare Python code out of a reply (handles Markdown fences and chatter). wrap uses it by default.
  • safedata.ModelError is raised when a wrapped model call fails.

Looking at the data

  • safedata.summarize(df, redact_pii=True) returns the text summary with trap warnings. This is what gets sent to the model.
  • safedata.report(df, path=None) writes an HTML quality report to path, or returns the HTML string if no path is given.

Running code safely

  • safedata.run_safely(code, df, result_var="result", isolate=True, timeout=10.0) runs code against a copy of df, blocks unsafe operations, checks nothing was damaged, and returns the result variable. Raises SafetyError if unsafe.
  • safedata.check_code(code) screens code without running it; returns a CodeCheck (.safe, .reason).
  • safedata.SafetyError is raised when code is blocked.

Token estimates

  • safedata.token_savings(df) returns a readable sentence.
  • safedata.token_stats(df) returns summary_tokens, raw_tokens, saved_tokens, saved_percent.
  • safedata.estimate_tokens(text) estimates tokens for any text (~4 chars each).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safedata_guard-1.0.4-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file safedata_guard-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: safedata_guard-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 31.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for safedata_guard-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 72b3c2303c6b253ceca1fe60da6d55ccc8d18bf435be52caec134ee20baf610a
MD5 a8191ca1f0094c774cfd8fc983b255f8
BLAKE2b-256 f68ff3bc4b0c88218e24bd226ffd424b97d2bbabab7c1aba59cee1ae586d37f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page