Skip to main content

Semantic search, PII masking, and schema understanding for Polars DataFrames

Project description

Omna

PyPI Python License Tests

Semantic search, PII masking, and schema understanding — directly on your Polars DataFrames. No vector database. No API key. Data never leaves your machine.


The problem

# Finding every insurance claim denial — painful
keywords = ["claim denied", "coverage rejected", "policy voided", ...]
pattern  = re.compile("|".join(keywords), re.IGNORECASE)
results  = df[df["text"].str.contains(pattern, na=False)]
# Still misses: "insurer refused to honour the policy"
# Still misses: "claim outcome: not payable"
# Still misses: medical claim rejections using clinical terminology
# ...50+ lines per task. Grows with every edge case. Still wrong.
# With Omna
results = df.omna.search("insurance claim denied", on="text", k=5)
# Finds ALL of them — including docs that never say "denied" literally.
# 9ms. 50,000 documents. Zero cloud.

filtered = df.omna.filter("insurance claim denied", on="text", threshold=0.73)
# Every semantically matching document above the threshold.
# No keyword lists. No guesswork. Pure meaning.

answer = results.omna.ask("What personal data do these documents expose?")
# → "These insurance documents expose SSNs, medical record numbers,
#    dates of birth, health plan numbers, and claimant identifiers."
# Instant. One line.

# Auditing for PII before the data ships — painful
for col in df.columns:
    for i, val in enumerate(df[col].to_list()):
        if re.search(r'\b\d{3}-\d{2}-\d{4}\b', str(val)):   # SSNs only
            print(f"row {i}, {col}: {str(val)[:60]}")
# Catches one pattern. Misses emails, phones, names, IBANs.
# No confidence score. No audit trail. No redaction.
# With Omna
df.omna.pii_report()   # audit — find every leak, every column
df.omna.mask_pii()     # redact — one line, full audit log
# Names, SSNs, emails, phone numbers — all gone. Local. No cloud.

Demo

The Sword — semantic search, filter, and ask across 50,000 documents:

Omna Sword Demo

The Shield — PII audit and redaction in one line:

Omna Shield Demo

Dataset: Gretel PII Benchmark (acquired by NVIDIA) — 50,000 synthetic documents built to test data privacy tools.


Install

pip install omna
python -m spacy download en_core_web_lg   # one-time, for PII detection

Requires Python 3.10+. No API key needed for search, filter, embed, pii_report, mask_pii, or understand. Only ask() requires ANTHROPIC_API_KEY.


Quick start

import polars as pl
import omna

df = pl.read_csv("documents.csv")

# 1 — explore the schema
omna.understand_df(df)

# 2 — audit for PII before anything touches the data
df.omna.pii_report()

# 3 — redact
clean = df.omna.mask_pii()

# 4 — build a search index once
clean.omna.embed("text")

# 5 — search by meaning
results = clean.omna.search("insurance claim denied", on="text", k=5)

# 6 — filter everything above a threshold
flagged = clean.omna.filter("insurance claim denied", on="text", threshold=0.73)

# 7 — ask a question in plain English
results.omna.ask("What personal data do these documents expose?")

What Omna does

Method What it does
omna.understand_df(df) Schema inference — labels, null rates, samples. No LLM.
df.omna.embed(column) Vectorize a text column once; reuse across sessions
df.omna.search(query, on, k) Top-k results by semantic meaning
df.omna.filter(query, on, threshold) Every row above a similarity threshold
df.omna.pii_report() Audit every string column for PII
df.omna.mask_pii() Redact PII, auto-save audit log
df.omna.ask(question) Natural language queries over your DataFrame

API reference

omna.understand_df(df) — explore before you do anything

No LLM. No API call. Analyzes column names, dtypes, null rates, and sample values.

omna.understand_df(df)
 column                dtype    null_pct   label     sample
 uid                   String     0.0%     category  24bb757...
 domain                String     0.0%     category  insurance, healthcare...
 document_type         String     0.0%     category  Invoice, ClaimForm...
 document_description  String     0.0%     text      An insurance claim...
 text                  String     0.0%     text      **Claim ID: 285-14...

Labels: email phone name id date text numeric boolean category unknown

df.omna.embed(column) — vectorize once, search forever

Converts text to 384-dimensional vectors using FastEmbed (local ONNX, no API key). Saves to .omna/{column}.parquet. Run once — search() and filter() load it automatically on every subsequent call.

df.omna.embed("text")
# → .omna/text.parquet

Model: BAAI/bge-small-en-v1.5 (~130 MB, downloaded once). Embed is a one-time cost.

Hardware 50k rows
MacBook Air M5 ~45 min
MacBook Pro M4 Max ~15 min
AWS GPU instance ~2 min
df.omna.search(query, on, k) — semantic search

Requires df.omna.embed("column") first.

results = df.omna.search("insurance claim denied", on="text", k=5)
 uid            document_type         domain      text                               _score
 67fccc1e207…   ClaimSummary          insurance   **Claim ID: 285-14-1755, Policy…   0.762
 b8ae088cd21…   ClaimSummary          insurance   **Claim Summary**…                 0.749
 de5bba0a2cc…   Insurance Claim Form  healthcare  **Insurance Claim Form**…          0.748
 ebccdde3b42…   Insurance Claim       healthcare  Insurance Claim for MED74974358…   0.747
 aebb0eb55fb…   ClaimForm             healthcare  **Claim Form** - Patient ID…       0.747

_score is cosine similarity (0–1). None of these documents contain the phrase "insurance claim denied" — Omna finds them by meaning.

df.omna.filter(query, on, threshold) — semantic filter

Requires df.omna.embed("column") first.

filtered = df.omna.filter("insurance claim denied", on="text", threshold=0.73)
# → N documents matched — all semantically related to claim denials

Returns every row above the threshold. Default: 0.3. Raise for precision, lower for recall.

Use search() for the top k. Use filter() for everything above a threshold.

df.omna.pii_report() — audit before you redact
df.omna.pii_report()
 column    detected types                                    hit rate   flagged
 entities  CREDIT_CARD, EMAIL_ADDRESS, PERSON, PHONE_NUMBER   85.4%    ✓ YES
 text      CREDIT_CARD, EMAIL_ADDRESS, PERSON, PHONE_NUMBER   78.1%    ✓ YES

Scans every string column. Returns hit rates, PII types, and confidence scores. Nothing is modified.

df.omna.mask_pii() — redact in one line
clean = df.omna.mask_pii()
# → <REDACTED> replaces every detected entity
# → audit log saved to .omna/pii_audit.parquet automatically

# Fast mode — regex only, ~10x faster, catches email/phone/SSN/URL
clean = df.omna.mask_pii(fast=True)

Detects: PERSON EMAIL_ADDRESS PHONE_NUMBER CREDIT_CARD US_SSN US_PASSPORT IP_ADDRESS IBAN_CODE URL and more.

df.omna.ask(question) — natural language queries

Sends schema + up to 20 sample rows to Claude. Requires ANTHROPIC_API_KEY.

export ANTHROPIC_API_KEY=sk-ant-...
results.omna.ask("What personal data do these documents expose?")
# → "These insurance documents expose SSNs, medical record numbers,
#    dates of birth, health plan numbers, and claimant identifiers."

# Override model
results.omna.ask("Summarise the key themes", model="claude-sonnet-4-6")

Default model: claude-haiku-4-5-20251001.


How it works

df.omna.search("insurance claim denied", on="text", k=5)
         │
         ▼
   embedder.py       FastEmbed — BAAI/bge-small-en-v1.5, local ONNX
                     query → [0.12, -0.34, 0.87, ...]  384-dim vector
         │
         ▼
   index.py          loads .omna/text.parquet → Arrow memory, zero-copy
                     50,000 stored vectors in Polars' own allocation
         │
         ▼
   similarity.rs     Rust kernel — cosine similarity over all vectors
                     returns top-k sorted descending, no Python loop
         │
         ▼
   frame.py          slices result rows, attaches _score → pl.DataFrame

The Rust kernel is 23 lines. Dot products and norms in machine code, no intermediate allocations. 500,000 × 384-dim in under 10ms on a single core.


Performance

50k rows 500k rows
Omna search 9ms 27ms
Omna filter 9ms 27ms
Pandas + FAISS ~25ms + index build ~25ms + index build
Polars keyword regex 1ms — exact match only 1ms — exact match only

Benchmarked on MacBook Air M5, BAAI/bge-small-en-v1.5 (384-dim), 10-query median, warm index.

Omna inherits Polars' Arrow columnar memory. The Rust similarity kernel operates on the same memory — no copy into NumPy, no copy into a C buffer.


FAQ

Does Omna send my data to the cloud?

No. Embedding, search, filter, PII detection, and masking all run locally. The only method that makes a network call is ask(), which sends schema metadata and sample rows to Claude via the Anthropic API — and only when you explicitly call it.

Do I need a GPU?

No. FastEmbed uses ONNX and runs on CPU. On Apple Silicon, it uses CoreML automatically. Embedding 50,000 documents takes ~45 minutes on a MacBook Air M5 — a one-time cost. After that, search() and filter() run in milliseconds from the saved index.

Why not FAISS / ChromaDB / Pinecone?

Those are vector databases. Omna is a Polars plugin. If your data already lives in a DataFrame, Omna adds semantic search with zero infrastructure — no separate process, no index server, no network hop. It's the difference between df.omna.search(...) and spinning up a separate service just to query your own data.

What PII types does Omna detect?

PERSON, EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, US_SSN, US_PASSPORT, IP_ADDRESS, IBAN_CODE, URL, DATE_TIME, LOCATION, and more. Detection uses Microsoft Presidio + spaCy NER, running fully local.

Which Polars versions are supported?

Omna is tested on Polars 0.20+. It installs as a namespace plugin via df.omna.* — no import needed after import omna.

The embed step took 45 minutes. Do I have to redo it every time?

No. embed() saves the index to .omna/{column}.parquet. Every subsequent search() or filter() call loads it in ~300ms. You only re-run embed() if your data changes.


Roadmap

# Coming in v0.2
matched = transactions.omna.join(regulatory_categories, on="description")
# Match rows between two DataFrames by meaning, not exact key.

Star the repo to follow progress.


License

Layer License
Python package (omna/) MIT
Rust engine (src/) Proprietary — ships as a compiled binary in the pip wheel

omna.dev · PyPI · GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

omna-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (293.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

omna-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (281.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

omna-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (257.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

omna-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl (262.9 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

omna-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (293.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

omna-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (281.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

omna-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (258.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

omna-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl (265.6 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

omna-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (293.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

omna-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (281.7 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

omna-0.1.0-cp310-cp310-macosx_11_0_arm64.whl (258.9 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

omna-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl (265.6 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file omna-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5f083f23f76141752ca956a1a5c7a047adabf87253850c0b2540443e79f09c70
MD5 a9521124a9daea27751a1bb96fbf38b8
BLAKE2b-256 68565048538ba28d27357975d2dbd724b7d0fe9112df955225b84e07c6d66bc7

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 57d72f806b13077202fb793a59ec389ab7a3ea49db1c962357aad9a7c6ce1181
MD5 f02bb9e5ecfe647885b487a3e7403a6f
BLAKE2b-256 8c5c7de0d4e2c089e639ace2eaa7b841dd813309baac046280c1276863f09522

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 07b1e481d2ef21032d61e466badc294088a5b22d65975651b7f1ec6f15637cfc
MD5 a9917c53c0f8c22f97f803e91a94384a
BLAKE2b-256 3cf48446fec670fa508a0ef6a6aa4500b58bd2ef117e5994ce977d94ee459655

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 685a3a338f062b07f0df6a9fd43148fdf92b404d74c510b0a4ee408c80f00433
MD5 af202f15c9c713adf18279dc3a00b9f0
BLAKE2b-256 02f119c46bcc6851b738b49dda8b31b6a25eaed6cafe4739ae394fa97ddd854e

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cb3858ab088e5659103df0940f2f509b8aaaa1470a5fc7391f55f80b3a2db20d
MD5 ad074c5d9e0c8ce491178f734849883e
BLAKE2b-256 a8b21c641a70f7df1dd75b6b803e735e6d4c76f27e4aa9662fd3eb7220fc5ac8

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 51ba9227fb41bfbaec990809f4c1132cb162a902cd48915c76a7abf64d126659
MD5 8a26f955bb93549135378b786dc0f262
BLAKE2b-256 42b47edaca0a6604b194656afeba20dc58b413a39a558e261645c30b25f333d0

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 973724dece8cab81f5d8dcd2941b9048718aa18d7345ab40498899799ca58145
MD5 cf5f403e31a3a31af72d66eb3c75260c
BLAKE2b-256 b9e2cdf746b71d791655c2c20959563536c41febed920ebc4c29a5fe54fdab44

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8cdd65034b0c951b0f6c28f56122adb8f9c455aa8e2764231916ae1d5594e460
MD5 9b6cc8d8a1ace52c2e693d572079f41f
BLAKE2b-256 e1ae1b5f2caecadcea263215e0bb4c3cd64472a22f5b7985b3d68bb128079a04

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d54547c164c9876c2e245d6ccacfeed0f668a443be11824f6a99e56b27a19177
MD5 889c18ab731cd6cde737d2bab17ca257
BLAKE2b-256 5e1434db30c864fcb160f1962f9cfe6b5205f53ead1a4c582407f7805f518219

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 00e2e62cbc09767be43b105b5ae8402d8089710db895487da76d4ce5d416b3ba
MD5 7bc684e0a8ff1b9d46ca24dca944974f
BLAKE2b-256 0677aee89b1f2c396230e4b45592832b228bb6fad942cec05632b42bb6d38249

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7902808c589d3226db87db4ec28b2b5ecb1ba669840d2f743d83e172bd56a80a
MD5 678e32e0ae9282d1e8b53c67fbad8c22
BLAKE2b-256 c1007de2b47e23c7c697426a2cfe0863884238bac1f1a3873c309b5986231d80

See more details on using hashes here.

File details

Details for the file omna-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for omna-0.1.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1ef54fd5b916f386fe1ad53fc77f5e4d80312e3f9bae85ac3eda195c752c77e9
MD5 7e69f4792637ff3f709f623525d54415
BLAKE2b-256 39ea5fed4a7b45feae355974cdb62ea05a962eea5b01029ee29bee8041a480e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page