Skip to main content

A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.

Project description

safedata-guard

A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.

Most "chat with your data" tools send your whole dataset to the model and then run whatever code it writes, unchecked. safedata-guard sits between the AI and your data and changes both halves of that: it sends a compact, quality-aware summary instead of raw rows (cheaper, and it keeps sensitive values out of the prompt), and it runs the AI's code behind guardrails on a copy of your data.

At a glance, it gives you:

  • a quality-aware summary of any DataFrame, with warnings about common data traps, used as a cheap prompt instead of raw rows
  • best-effort PII masking of sample values before they reach the model
  • a guardrail layer that screens and runs AI-written code on a copy, with a timeout, and feeds errors back so the model can correct itself
  • check_code() to screen code without running it, for use in your own agent
  • works with pandas or Polars, and a safedata command-line tool
  • honest about its limits: defense in depth, not a security sandbox

What it does

1. It summarises your data before sending it to an AI.

Instead of pushing 100,000 rows into a prompt (slow and expensive), it sends a short summary: the columns, their types, a few sample values, and basic stats. The summary also flags common data problems that trip up analysis, such as:

  • numbers stored as text (like "$500" or "1,000")
  • the same category written several ways (like "California" and "CA")
  • ID columns that are not actually unique
  • columns that are completely empty
  • columns that are mostly empty
  • Excel dates stored as plain numbers (like 45292)
  • negative values in columns that should not have them

2. It runs the AI's code with guardrails.

When the AI writes code to answer a question, safedata-guard runs it on a copy of your data, in a separate process with a timeout. Before running, an AST-based screen refuses unsafe imports, introspection/dunder tricks, dangerous builtins, and data/file readers and writers (a small set of analysis imports like pandas, numpy, and datetime is allowed). The AI may add or transform columns freely (it works on the copy), but after running, the bodyguard checks that it did not silently drop rows from the data and that the result is not silently empty, and if something looks wrong it sends the error back so the AI can fix its own code.

Scope: please read this honestly. This is defense in depth for cooperative / semi-trusted model output: it stops the destructive accidents an honest model makes, and the obvious escape attempts. It is not a security sandbox for deliberately malicious untrusted code. In-process Python "sandboxes" have a long history of clever escapes, and on Windows a child process still shares your filesystem permissions; so the subprocess gives you timeout and crash isolation, not a filesystem jail. If you need to run code from an untrusted source, run safedata-guard inside OS-level isolation (a container, a locked-down user account, or a VM). It also does not prove the AI's maths is correct, which no tool can do in general.

Install

pip install safedata-guard

Using Polars instead of pandas

safedata-guard works with either pandas or Polars DataFrames. The safety screen, the copy-and-isolate execution, and the data-trap summary all handle both. To use Polars, install the extra:

pip install "safedata-guard[polars]"

Then pass a Polars frame anywhere you would pass a pandas one; the library detects the type. The safety screen blocks Polars' file writers and readers (write_csv, write_parquet, lazy sink_*, read_*, scan_*) the same way it blocks the pandas equivalents. The scope note above applies identically: it is defense in depth for cooperative model output, not a sandbox for malicious code.

PII masking in the summary

The summary sends a few real sample values to the LLM so it can write correct code. Those samples can contain personal data. By default, safedata-guard masks obvious PII (emails, card-like numbers, phones, SSNs, IPs) before the summary leaves your machine, and notes which columns were masked.

safedata.summarize(df)                    # PII masked by default
safedata.summarize(df, redact_pii=False)  # raw samples, if you are sure

This is best-effort, regex-based redaction, not a compliance guarantee. It catches common well-formed patterns and will miss unusual formats, names, addresses, and free-text. If you handle regulated data, keep it out of third-party LLMs by policy; do not rely on a regex. Masking is on because leaking less is better than leaking more, but treat it as a seatbelt, not a vault.

Command line: check a file in one line

After installing, you get a safedata command. Point it at a data file to see the quality summary, data-trap warnings, and token-saving estimate without writing any Python:

safedata check sales.csv
safedata check data.xlsx --report quality.html
safedata check sales.csv --no-redact --samples 5

Supported file types: .csv, .tsv, .xlsx, .xls, .parquet, .json. The command only reads and summarises the file; it never executes code. --report also writes the HTML quality report, and --no-redact shows raw sample values instead of masking detected PII.

Quick example

This runs end to end today. The my_model function below returns the code as a string. In a real project you would replace its body with a call to a model of your choice.

import safedata
import pandas as pd

df = pd.DataFrame({
    "date": ["2025-01-01", "2024-05-01", "2025-08-01"],
    "amount": [100.0, 50.0, 200.0],
})

def my_model(prompt):
    # Replace this with a call to your own model.
    # It should take the prompt text and return Python code as a string.
    return "result = df[df['date'].str.startswith('2025')]['amount'].sum()"

agent = safedata.Agent(model=my_model)
out = agent.ask(df, "What were total sales in 2025?")

print(out.answer)     # 300.0
print(out.blocked)    # False if the code passed the safety checks
print(out.attempts)   # the list of code attempts that were made

Connecting a real model with wrap()

Real models are messy. They wrap code in Markdown code fences, add sentences like "Here is the code:", and sometimes fail because of a bad key or no internet. safedata.wrap() takes any function that sends text to a model and returns the model's text, and it handles the messy parts for you: pulling the bare code out of the reply and turning failures into a clear message instead of a crash.

You write a small function that calls your model. It does not matter which model it is: a hosted one like Claude or GPT-4, a model running on your own machine, or your own custom function. As long as it takes text and returns text, it works.

import safedata

# Example shape of your own model call. Replace the body with your model.
def my_call(prompt):
    reply = some_model_that_takes_text_and_returns_text(prompt)
    return reply

agent = safedata.Agent(model=safedata.wrap(my_call))
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer)

Because wrap works with any text-in, text-out function, the library is not tied to a single provider. You can point it at whatever model you already have.

A note on quality: the library connects to any model, but the quality of the answers depends on the model. A strong model writes good code on the first try. A weaker model may write code that gets blocked by the safety checks and then retried. The library stays safe either way, but a better model means fewer retries.

Use the parts on their own

print(safedata.summarize(df))                  # the data summary with warnings
result = safedata.run_safely(code_string, df)  # run code through the safety layer
verdict = safedata.check_code(code_string)     # is this code safe? (does NOT run it)
safedata.report(df, "report.html")             # write an HTML quality report
print(safedata.token_savings(df))              # estimated token (cost) saving

Token and cost saving

Sending a whole table to a model can cost a lot, because every row becomes tokens that you pay for. safedata sends a short summary instead, which is far smaller. safedata.token_savings(df) shows the estimated saving in plain words:

print(safedata.token_savings(df))
# Sending the summary uses about 620 tokens instead of about 13,007,168 for the
# raw data. Estimated saving: 99.99% (about 13,006,548 tokens).

Every agent.ask(...) result also carries a tokens estimate you can inspect:

out = agent.ask(df, "What were total sales in 2025?")
print(out.tokens)   # {'summary_tokens': ..., 'raw_tokens': ..., 'saved_percent': ...}

These numbers are estimates. Each model provider counts tokens with its own method, so exact figures vary, but the estimate shows the scale of the saving.

HTML report

safedata.report(df, "report.html") writes a simple web page that lists each column with a red, amber or green status, the problems it found, and suggested fixes. It is meant to be readable by someone who does not write code. Call it without a path to get the HTML back as a string instead.

How the question-answering loop works

  1. The data is summarised, including the trap warnings.
  2. Your model writes code based on the summary and the question.
  3. The code runs on a copy and is checked for safety.
  4. If it is blocked, the error is sent back and the model tries again.

About the model

The library does not include a model. You supply one through a single function, so you can use a local model or a hosted one without changing anything else in your code.

Full function reference

Everything the library makes available:

Asking questions

  • safedata.Agent(model, max_retries=3) builds an agent. model is a function that takes a prompt and returns code (usually made with wrap). max_retries is how many times the model may correct itself after a block.
  • agent.ask(df, question, verbose=False) runs the full loop and returns a result object. Set verbose=True to print each code attempt as it happens.
  • The result object has: .answer (the result), .blocked (True if it could not be completed safely), .reason (why it was blocked, if so), .attempts (the list of code attempts), and .tokens (the token estimate for the call).

Connecting a model

  • safedata.wrap(call, clean=...) turns any text-in, text-out function into a model the agent can use. It strips code out of messy replies and turns failures into a clear ModelError.
  • safedata.extract_code(text) is the helper that pulls bare Python code out of a reply (handling Markdown code fences and chatter). wrap uses it by default; you can call it yourself or pass your own version to wrap.
  • safedata.ModelError is the error raised when a wrapped model call fails (bad key, no internet, unusable output).

Looking at the data

  • safedata.summarize(df) returns the short text summary, including the data trap warnings. This is what gets sent to the model.
  • safedata.report(df, path=None) writes an HTML quality report to path, or returns the HTML as a string if no path is given.

Running code safely on its own

  • safedata.run_safely(code, df, result_var="result") runs a piece of code against a copy of df, blocks unsafe operations, checks that nothing was damaged, and returns the value of the result variable. Raises SafetyError if the code is unsafe.
  • safedata.SafetyError is the error raised when code is blocked.

Token and cost estimates

  • safedata.token_savings(df) returns a readable sentence describing the estimated token saving.
  • safedata.token_stats(df) returns the raw numbers as a dictionary: summary_tokens, raw_tokens, saved_tokens, saved_percent.
  • safedata.estimate_tokens(text) estimates the number of tokens in any piece of text, using a rough rule of about four characters per token.

All token figures are estimates. Each model provider counts tokens with its own method, so exact numbers vary, but the estimate shows the scale of the saving.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safedata_guard-1.0.1.tar.gz (38.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safedata_guard-1.0.1-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file safedata_guard-1.0.1.tar.gz.

File metadata

  • Download URL: safedata_guard-1.0.1.tar.gz
  • Upload date:
  • Size: 38.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for safedata_guard-1.0.1.tar.gz
Algorithm Hash digest
SHA256 cbf8d2e431b8835a9d46fa7ad141c67421a85cef9be6a7e728ff4b81caf7526f
MD5 436457a7d945aab3975f4f8111d2af28
BLAKE2b-256 6c22f8b59c21e9c384cd4de3fba6002feea329a207be26e3485e719a28416078

See more details on using hashes here.

File details

Details for the file safedata_guard-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: safedata_guard-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for safedata_guard-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1c18df57bd4fc757d5a3064d79dd52c5f69af5f777eb95d2f33c573d34b12802
MD5 881a9c2ff41f678191e49afafab61828
BLAKE2b-256 f0cb764114fd8f0bb01fe9b33ca38fc27da525332e2b3b87425eb74435d538a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page