Skip to main content

A data-quality checker and guardrail layer between an AI and your data. It summarises data with quality warnings and runs AI-written code with guardrails (defense in depth, not a security sandbox).

Project description

safedata-guard

A safety layer and data-quality checker that sits between an AI and your data.

Most "chat with your data" tools let the AI run code on your DataFrame without any checks. safedata-guard adds two things those tools usually skip.

What it does

1. It summarises your data before sending it to an AI.

Instead of pushing 100,000 rows into a prompt (slow and expensive), it sends a short summary: the columns, their types, a few sample values, and basic stats. The summary also flags common data problems that trip up analysis, such as:

  • numbers stored as text (like "$500" or "1,000")
  • the same category written several ways (like "California" and "CA")
  • ID columns that are not actually unique
  • columns that are completely empty
  • columns that are mostly empty
  • Excel dates stored as plain numbers (like 45292)
  • negative values in columns that should not have them

This is more useful than df.describe(), which reports stats but does not warn you about these traps.

2. It runs the AI's code with guardrails.

When the AI writes code to answer a question, safedata-guard runs it on a copy of your data, in a separate process with a timeout. Before running, an AST-based screen refuses unsafe imports, introspection/dunder tricks, dangerous builtins, and data/file readers and writers (a small set of analysis imports like pandas, numpy, and datetime is allowed). The AI may add or transform columns freely (it works on the copy), but after running, the bodyguard checks that it did not silently drop rows from the data and that the result is not silently empty, and if something looks wrong it sends the error back so the AI can fix its own code.

Scope: please read this honestly. This is defense in depth for cooperative / semi-trusted model output: it stops the destructive accidents an honest model makes, and the obvious escape attempts. It is not a security sandbox for deliberately malicious untrusted code. In-process Python "sandboxes" have a long history of clever escapes, and on Windows a child process still shares your filesystem permissions; so the subprocess gives you timeout and crash isolation, not a filesystem jail. If you need to run code from an untrusted source, run safedata-guard inside OS-level isolation (a container, a locked-down user account, or a VM). It also does not prove the AI's maths is correct, which no tool can do in general.

Install

pip install safedata-guard

Using Polars instead of pandas

safedata-guard works with either pandas or Polars DataFrames. The safety screen, the copy-and-isolate execution, and the data-trap summary all handle both. To use Polars, install the extra:

pip install "safedata-guard[polars]"

Then pass a Polars frame anywhere you would pass a pandas one; the library detects the type. The safety screen blocks Polars' file writers and readers (write_csv, write_parquet, lazy sink_*, read_*, scan_*) the same way it blocks the pandas equivalents. The scope note above applies identically: it is defense in depth for cooperative model output, not a sandbox for malicious code.

PII masking in the summary

The summary sends a few real sample values to the LLM so it can write correct code. Those samples can contain personal data. By default, safedata-guard masks obvious PII (emails, card-like numbers, phones, SSNs, IPs) before the summary leaves your machine, and notes which columns were masked.

safedata.summarize(df)                    # PII masked by default
safedata.summarize(df, redact_pii=False)  # raw samples, if you are sure

This is best-effort, regex-based redaction, not a compliance guarantee. It catches common well-formed patterns and will miss unusual formats, names, addresses, and free-text. If you handle regulated data, keep it out of third-party LLMs by policy; do not rely on a regex. Masking is on because leaking less is better than leaking more, but treat it as a seatbelt, not a vault.

Command line: check a file in one line

After installing, you get a safedata command. Point it at a data file to see the quality summary, data-trap warnings, and token-saving estimate without writing any Python:

safedata check sales.csv
safedata check data.xlsx --report quality.html
safedata check sales.csv --no-redact --samples 5

Supported file types: .csv, .tsv, .xlsx, .xls, .parquet, .json. The command only reads and summarises the file; it never executes code. --report also writes the HTML quality report, and --no-redact shows raw sample values instead of masking detected PII.

Quick example

This runs end to end today. The my_model function below returns the code as a string. In a real project you would replace its body with a call to a model of your choice.

import safedata
import pandas as pd

df = pd.DataFrame({
    "date": ["2025-01-01", "2024-05-01", "2025-08-01"],
    "amount": [100.0, 50.0, 200.0],
})

def my_model(prompt):
    # Replace this with a call to your own model.
    # It should take the prompt text and return Python code as a string.
    return "result = df[df['date'].str.startswith('2025')]['amount'].sum()"

agent = safedata.Agent(model=my_model)
out = agent.ask(df, "What were total sales in 2025?")

print(out.answer)     # 300.0
print(out.blocked)    # False if the code passed the safety checks
print(out.attempts)   # the list of code attempts that were made

Connecting a real model with wrap()

Real models are messy. They wrap code in ```python fences, add sentences like "Here is the code:", and sometimes fail because of a bad key or no internet. safedata.wrap() takes any function that sends text to a model and returns the model's text, and it handles the messy parts for you: pulling the bare code out of the reply and turning failures into a clear message instead of a crash.

You write a small function that calls your model. It does not matter which model it is: a hosted one like Claude or GPT-4, a model running on your own machine, or your own custom function. As long as it takes text and returns text, it works.

import safedata

# Example shape of your own model call. Replace the body with your model.
def my_call(prompt):
    reply = some_model_that_takes_text_and_returns_text(prompt)
    return reply

agent = safedata.Agent(model=safedata.wrap(my_call))
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer)

Because wrap works with any text-in, text-out function, the library is not tied to a single provider. You can point it at whatever model you already have.

A note on quality: the library connects to any model, but the quality of the answers depends on the model. A strong model writes good code on the first try. A weaker model may write code that gets blocked by the safety checks and then retried. The library stays safe either way, but a better model means fewer retries.

Use the parts on their own

print(safedata.summarize(df))                  # the data summary with warnings
result = safedata.run_safely(code_string, df)  # run code through the safety layer
verdict = safedata.check_code(code_string)     # is this code safe? (does NOT run it)
safedata.report(df, "report.html")             # write an HTML quality report
print(safedata.token_savings(df))              # estimated token (cost) saving

Token and cost saving

Sending a whole table to a model can cost a lot, because every row becomes tokens that you pay for. safedata sends a short summary instead, which is far smaller. safedata.token_savings(df) shows the estimated saving in plain words:

print(safedata.token_savings(df))
# Sending the summary uses about 620 tokens instead of about 13,007,168 for the
# raw data. Estimated saving: 99.99% (about 13,006,548 tokens).

Every agent.ask(...) result also carries a tokens estimate you can inspect:

out = agent.ask(df, "What were total sales in 2025?")
print(out.tokens)   # {'summary_tokens': ..., 'raw_tokens': ..., 'saved_percent': ...}

These numbers are estimates. Each model provider counts tokens with its own method, so exact figures vary, but the estimate shows the scale of the saving.

HTML report

safedata.report(df, "report.html") writes a simple web page that lists each column with a red, amber or green status, the problems it found, and suggested fixes. It is meant to be readable by someone who does not write code. Call it without a path to get the HTML back as a string instead.

How the question-answering loop works

  1. The data is summarised, including the trap warnings.
  2. Your model writes code based on the summary and the question.
  3. The code runs on a copy and is checked for safety.
  4. If it is blocked, the error is sent back and the model tries again.

About the model

The library does not include a model. You supply one through a single function, so you can use a local model or a hosted one without changing anything else in your code.

Full function reference

Everything the library makes available:

Asking questions

  • safedata.Agent(model, max_retries=3) builds an agent. model is a function that takes a prompt and returns code (usually made with wrap). max_retries is how many times the model may correct itself after a block.
  • agent.ask(df, question, verbose=False) runs the full loop and returns a result object. Set verbose=True to print each code attempt as it happens.
  • The result object has: .answer (the result), .blocked (True if it could not be completed safely), .reason (why it was blocked, if so), .attempts (the list of code attempts), and .tokens (the token estimate for the call).

Connecting a model

  • safedata.wrap(call, clean=...) turns any text-in, text-out function into a model the agent can use. It strips code out of messy replies and turns failures into a clear ModelError.
  • safedata.extract_code(text) is the helper that pulls bare Python code out of a reply (handling ``` fences and chatter). wrap uses it by default; you can call it yourself or pass your own version to wrap.
  • safedata.ModelError is the error raised when a wrapped model call fails (bad key, no internet, unusable output).

Looking at the data

  • safedata.summarize(df) returns the short text summary, including the data trap warnings. This is what gets sent to the model.
  • safedata.report(df, path=None) writes an HTML quality report to path, or returns the HTML as a string if no path is given.

Running code safely on its own

  • safedata.run_safely(code, df, result_var="result") runs a piece of code against a copy of df, blocks unsafe operations, checks that nothing was damaged, and returns the value of the result variable. Raises SafetyError if the code is unsafe.
  • safedata.SafetyError is the error raised when code is blocked.

Token and cost estimates

  • safedata.token_savings(df) returns a readable sentence describing the estimated token saving.
  • safedata.token_stats(df) returns the raw numbers as a dictionary: summary_tokens, raw_tokens, saved_tokens, saved_percent.
  • safedata.estimate_tokens(text) estimates the number of tokens in any piece of text, using a rough rule of about four characters per token.

All token figures are estimates. Each model provider counts tokens with its own method, so exact numbers vary, but the estimate shows the scale of the saving.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safedata_guard-1.0.0.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safedata_guard-1.0.0-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file safedata_guard-1.0.0.tar.gz.

File metadata

  • Download URL: safedata_guard-1.0.0.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for safedata_guard-1.0.0.tar.gz
Algorithm Hash digest
SHA256 13c4c54a1aae2ffe0c935886105893e3593e3744d5e89f11884988ab36bf5cfb
MD5 3efbbc854408485991a8f34d6fbd2cf6
BLAKE2b-256 602b174380392154271a739c4e33dcccfb5fcf83d9560ad59025b784f5c59d91

See more details on using hashes here.

File details

Details for the file safedata_guard-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: safedata_guard-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for safedata_guard-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e4d987ac2ce475e48f40c3d568bc56a4635babca7175cbe547844a5b47282750
MD5 68f2dd2f4a341bd72a735f99783cf701
BLAKE2b-256 2fd556ceaf54e88220c498b7ac29e54183c6bcddfac7e6735ead4baaf10253e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page