Semantic search, PII masking, and schema understanding for Polars DataFrames
Project description
Omna
Semantic search, PII masking, and schema understanding — directly on your Polars DataFrames. No vector database. No API key. Data never leaves your machine.
The problem
# Finding every insurance claim denial — painful
keywords = ["claim denied", "coverage rejected", "policy voided", ...]
pattern = re.compile("|".join(keywords), re.IGNORECASE)
results = df[df["text"].str.contains(pattern, na=False)]
# Still misses: "insurer refused to honour the policy"
# Still misses: "claim outcome: not payable"
# Still misses: medical claim rejections using clinical terminology
# ...50+ lines per task. Grows with every edge case. Still wrong.
# With Omna
results = df.omna.search("insurance claim denied", on="text", k=5)
# Finds ALL of them — including docs that never say "denied" literally.
# 9ms. 50,000 documents. Zero cloud.
filtered = df.omna.filter("insurance claim denied", on="text", threshold=0.73)
# Every semantically matching document above the threshold.
# No keyword lists. No guesswork. Pure meaning.
answer = results.omna.ask("What personal data do these documents expose?")
# → "These insurance documents expose SSNs, medical record numbers,
# dates of birth, health plan numbers, and claimant identifiers."
# Instant. One line.
# Auditing for PII before the data ships — painful
for col in df.columns:
for i, val in enumerate(df[col].to_list()):
if re.search(r'\b\d{3}-\d{2}-\d{4}\b', str(val)): # SSNs only
print(f"row {i}, {col}: {str(val)[:60]}")
# Catches one pattern. Misses emails, phones, names, IBANs.
# No confidence score. No audit trail. No redaction.
# With Omna
df.omna.pii_report() # audit — find every leak, every column
df.omna.mask_pii() # redact — one line, full audit log
# Names, SSNs, emails, phone numbers — all gone. Local. No cloud.
Demo
The Sword — semantic search, filter, and ask across 50,000 documents:
The Shield — PII audit and redaction in one line:
Dataset: Gretel PII Benchmark (acquired by NVIDIA) — 50,000 synthetic documents built to test data privacy tools.
Install
pip install omna
python -m spacy download en_core_web_lg # one-time, for PII detection
Requires Python 3.10+. No API key needed for search, filter, embed, pii_report, mask_pii, or understand. Only ask() requires ANTHROPIC_API_KEY.
Quick start
import polars as pl
import omna
df = pl.read_csv("documents.csv")
# 1 — explore the schema
omna.understand_df(df)
# 2 — audit for PII before anything touches the data
df.omna.pii_report()
# 3 — redact
clean = df.omna.mask_pii()
# 4 — build a search index once
clean.omna.embed("text")
# 5 — search by meaning
results = clean.omna.search("insurance claim denied", on="text", k=5)
# 6 — filter everything above a threshold
flagged = clean.omna.filter("insurance claim denied", on="text", threshold=0.73)
# 7 — ask a question in plain English
results.omna.ask("What personal data do these documents expose?")
What Omna does
| Method | What it does |
|---|---|
omna.understand_df(df) |
Schema inference — labels, null rates, samples. No LLM. |
df.omna.embed(column) |
Vectorize a text column once; reuse across sessions |
df.omna.search(query, on, k) |
Top-k results by semantic meaning |
df.omna.filter(query, on, threshold) |
Every row above a similarity threshold |
df.omna.pii_report() |
Audit every string column for PII |
df.omna.mask_pii() |
Redact PII, auto-save audit log |
df.omna.ask(question) |
Natural language queries over your DataFrame |
API reference
omna.understand_df(df) — explore before you do anything
No LLM. No API call. Analyzes column names, dtypes, null rates, and sample values.
omna.understand_df(df)
column dtype null_pct label sample
uid String 0.0% category 24bb757...
domain String 0.0% category insurance, healthcare...
document_type String 0.0% category Invoice, ClaimForm...
document_description String 0.0% text An insurance claim...
text String 0.0% text **Claim ID: 285-14...
Labels: email phone name id date text numeric boolean category unknown
df.omna.embed(column) — vectorize once, search forever
Converts text to 384-dimensional vectors using FastEmbed (local ONNX, no API key). Saves to .omna/{column}.parquet. Run once — search() and filter() load it automatically on every subsequent call.
df.omna.embed("text")
# → .omna/text.parquet
Model: BAAI/bge-small-en-v1.5 (~130 MB, downloaded once). Embed is a one-time cost.
| Hardware | 50k rows |
|---|---|
| MacBook Air M5 | ~45 min |
| MacBook Pro M4 Max | ~15 min |
| AWS GPU instance | ~2 min |
df.omna.search(query, on, k) — semantic search
Requires
df.omna.embed("column")first.
results = df.omna.search("insurance claim denied", on="text", k=5)
uid document_type domain text _score
67fccc1e207… ClaimSummary insurance **Claim ID: 285-14-1755, Policy… 0.762
b8ae088cd21… ClaimSummary insurance **Claim Summary**… 0.749
de5bba0a2cc… Insurance Claim Form healthcare **Insurance Claim Form**… 0.748
ebccdde3b42… Insurance Claim healthcare Insurance Claim for MED74974358… 0.747
aebb0eb55fb… ClaimForm healthcare **Claim Form** - Patient ID… 0.747
_score is cosine similarity (0–1). None of these documents contain the phrase "insurance claim denied" — Omna finds them by meaning.
df.omna.filter(query, on, threshold) — semantic filter
Requires
df.omna.embed("column")first.
filtered = df.omna.filter("insurance claim denied", on="text", threshold=0.73)
# → N documents matched — all semantically related to claim denials
Returns every row above the threshold. Default: 0.3. Raise for precision, lower for recall.
Use search() for the top k. Use filter() for everything above a threshold.
df.omna.pii_report() — audit before you redact
df.omna.pii_report()
column detected types hit rate flagged
entities CREDIT_CARD, EMAIL_ADDRESS, PERSON, PHONE_NUMBER 85.4% ✓ YES
text CREDIT_CARD, EMAIL_ADDRESS, PERSON, PHONE_NUMBER 78.1% ✓ YES
Scans every string column. Returns hit rates, PII types, and confidence scores. Nothing is modified.
df.omna.mask_pii() — redact in one line
clean = df.omna.mask_pii()
# → <REDACTED> replaces every detected entity
# → audit log saved to .omna/pii_audit.parquet automatically
# Fast mode — regex only, ~10x faster, catches email/phone/SSN/URL
clean = df.omna.mask_pii(fast=True)
Detects: PERSON EMAIL_ADDRESS PHONE_NUMBER CREDIT_CARD US_SSN US_PASSPORT IP_ADDRESS IBAN_CODE URL and more.
df.omna.ask(question) — natural language queries
Sends schema + up to 20 sample rows to Claude. Requires ANTHROPIC_API_KEY.
export ANTHROPIC_API_KEY=sk-ant-...
results.omna.ask("What personal data do these documents expose?")
# → "These insurance documents expose SSNs, medical record numbers,
# dates of birth, health plan numbers, and claimant identifiers."
# Override model
results.omna.ask("Summarise the key themes", model="claude-sonnet-4-6")
Default model: claude-haiku-4-5-20251001.
How it works
df.omna.search("insurance claim denied", on="text", k=5)
│
▼
embedder.py FastEmbed — BAAI/bge-small-en-v1.5, local ONNX
query → [0.12, -0.34, 0.87, ...] 384-dim vector
│
▼
index.py loads .omna/text.parquet → Arrow memory, zero-copy
50,000 stored vectors in Polars' own allocation
│
▼
similarity.rs Rust kernel — cosine similarity over all vectors
returns top-k sorted descending, no Python loop
│
▼
frame.py slices result rows, attaches _score → pl.DataFrame
The Rust kernel is 23 lines. Dot products and norms in machine code, no intermediate allocations. 500,000 × 384-dim in under 10ms on a single core.
Performance
| 50k rows | 500k rows | |
|---|---|---|
| Omna search | 9ms | 27ms |
| Omna filter | 9ms | 27ms |
| Pandas + FAISS | ~25ms + index build | ~25ms + index build |
| Polars keyword regex | 1ms — exact match only | 1ms — exact match only |
Benchmarked on MacBook Air M5, BAAI/bge-small-en-v1.5 (384-dim), 10-query median, warm index.
Omna inherits Polars' Arrow columnar memory. The Rust similarity kernel operates on the same memory — no copy into NumPy, no copy into a C buffer.
FAQ
Does Omna send my data to the cloud?
No. Embedding, search, filter, PII detection, and masking all run locally. The only method that makes a network call is ask(), which sends schema metadata and sample rows to Claude via the Anthropic API — and only when you explicitly call it.
Do I need a GPU?
No. FastEmbed uses ONNX and runs on CPU. On Apple Silicon, it uses CoreML automatically. Embedding 50,000 documents takes ~45 minutes on a MacBook Air M5 — a one-time cost. After that, search() and filter() run in milliseconds from the saved index.
Why not FAISS / ChromaDB / Pinecone?
Those are vector databases. Omna is a Polars plugin. If your data already lives in a DataFrame, Omna adds semantic search with zero infrastructure — no separate process, no index server, no network hop. It's the difference between df.omna.search(...) and spinning up a separate service just to query your own data.
What PII types does Omna detect?
PERSON, EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, US_SSN, US_PASSPORT, IP_ADDRESS, IBAN_CODE, URL, DATE_TIME, LOCATION, and more. Detection uses Microsoft Presidio + spaCy NER, running fully local.
Which Polars versions are supported?
Omna is tested on Polars 0.20+. It installs as a namespace plugin via df.omna.* — no import needed after import omna.
The embed step took 45 minutes. Do I have to redo it every time?
No. embed() saves the index to .omna/{column}.parquet. Every subsequent search() or filter() call loads it in ~300ms. You only re-run embed() if your data changes.
Roadmap
# Coming in v0.2
matched = transactions.omna.join(regulatory_categories, on="description")
# Match rows between two DataFrames by meaning, not exact key.
Star the repo to follow progress.
License
| Layer | License |
|---|---|
Python package (omna/) |
MIT |
Rust engine (src/) |
Proprietary — ships as a compiled binary in the pip wheel |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omna-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: omna-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 293.1 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f083f23f76141752ca956a1a5c7a047adabf87253850c0b2540443e79f09c70
|
|
| MD5 |
a9521124a9daea27751a1bb96fbf38b8
|
|
| BLAKE2b-256 |
68565048538ba28d27357975d2dbd724b7d0fe9112df955225b84e07c6d66bc7
|
File details
Details for the file omna-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: omna-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 281.1 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57d72f806b13077202fb793a59ec389ab7a3ea49db1c962357aad9a7c6ce1181
|
|
| MD5 |
f02bb9e5ecfe647885b487a3e7403a6f
|
|
| BLAKE2b-256 |
8c5c7de0d4e2c089e639ace2eaa7b841dd813309baac046280c1276863f09522
|
File details
Details for the file omna-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: omna-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 257.3 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07b1e481d2ef21032d61e466badc294088a5b22d65975651b7f1ec6f15637cfc
|
|
| MD5 |
a9917c53c0f8c22f97f803e91a94384a
|
|
| BLAKE2b-256 |
3cf48446fec670fa508a0ef6a6aa4500b58bd2ef117e5994ce977d94ee459655
|
File details
Details for the file omna-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: omna-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 262.9 kB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
685a3a338f062b07f0df6a9fd43148fdf92b404d74c510b0a4ee408c80f00433
|
|
| MD5 |
af202f15c9c713adf18279dc3a00b9f0
|
|
| BLAKE2b-256 |
02f119c46bcc6851b738b49dda8b31b6a25eaed6cafe4739ae394fa97ddd854e
|
File details
Details for the file omna-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: omna-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 293.4 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb3858ab088e5659103df0940f2f509b8aaaa1470a5fc7391f55f80b3a2db20d
|
|
| MD5 |
ad074c5d9e0c8ce491178f734849883e
|
|
| BLAKE2b-256 |
a8b21c641a70f7df1dd75b6b803e735e6d4c76f27e4aa9662fd3eb7220fc5ac8
|
File details
Details for the file omna-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: omna-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 281.6 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51ba9227fb41bfbaec990809f4c1132cb162a902cd48915c76a7abf64d126659
|
|
| MD5 |
8a26f955bb93549135378b786dc0f262
|
|
| BLAKE2b-256 |
42b47edaca0a6604b194656afeba20dc58b413a39a558e261645c30b25f333d0
|
File details
Details for the file omna-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: omna-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 258.9 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
973724dece8cab81f5d8dcd2941b9048718aa18d7345ab40498899799ca58145
|
|
| MD5 |
cf5f403e31a3a31af72d66eb3c75260c
|
|
| BLAKE2b-256 |
b9e2cdf746b71d791655c2c20959563536c41febed920ebc4c29a5fe54fdab44
|
File details
Details for the file omna-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: omna-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 265.6 kB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cdd65034b0c951b0f6c28f56122adb8f9c455aa8e2764231916ae1d5594e460
|
|
| MD5 |
9b6cc8d8a1ace52c2e693d572079f41f
|
|
| BLAKE2b-256 |
e1ae1b5f2caecadcea263215e0bb4c3cd64472a22f5b7985b3d68bb128079a04
|
File details
Details for the file omna-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: omna-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 293.4 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d54547c164c9876c2e245d6ccacfeed0f668a443be11824f6a99e56b27a19177
|
|
| MD5 |
889c18ab731cd6cde737d2bab17ca257
|
|
| BLAKE2b-256 |
5e1434db30c864fcb160f1962f9cfe6b5205f53ead1a4c582407f7805f518219
|
File details
Details for the file omna-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: omna-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 281.7 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00e2e62cbc09767be43b105b5ae8402d8089710db895487da76d4ce5d416b3ba
|
|
| MD5 |
7bc684e0a8ff1b9d46ca24dca944974f
|
|
| BLAKE2b-256 |
0677aee89b1f2c396230e4b45592832b228bb6fad942cec05632b42bb6d38249
|
File details
Details for the file omna-0.1.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: omna-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 258.9 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7902808c589d3226db87db4ec28b2b5ecb1ba669840d2f743d83e172bd56a80a
|
|
| MD5 |
678e32e0ae9282d1e8b53c67fbad8c22
|
|
| BLAKE2b-256 |
c1007de2b47e23c7c697426a2cfe0863884238bac1f1a3873c309b5986231d80
|
File details
Details for the file omna-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl.
File metadata
- Download URL: omna-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl
- Upload date:
- Size: 265.6 kB
- Tags: CPython 3.10, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ef54fd5b916f386fe1ad53fc77f5e4d80312e3f9bae85ac3eda195c752c77e9
|
|
| MD5 |
7e69f4792637ff3f709f623525d54415
|
|
| BLAKE2b-256 |
39ea5fed4a7b45feae355974cdb62ea05a962eea5b01029ee29bee8041a480e3
|