One-command orchestration for multimodal semantic search in BigQuery

These details have not been verified by PyPI

Project links

Project description

grepctl - Semantic Search For Your Data Lake

grepctl is a utility that converts unstructured content in your data lake into a semantically searchable index with a single command:

grepctl ingest -b <bucket>

Data Modalities & Processing

grepctl processes 9 different data types automatically:

Modality	Processing Method
Text/Markdown	Direct content extraction, preserving structure
PDF	OCR via Google Document AI for text extraction
Office Documents	Document AI extracts content from .docx, .xlsx, .pptx
Images	Vision API extracts labels, text, objects, and faces
Audio	Speech-to-Text API transcribes to searchable text
Video	Video Intelligence API analyzes frames and transcribes speech
JSON/CSV	Structured data parsing with field preservation

Each modality is converted to text, chunked intelligently, and embedded using Vertex AI's text-embedding-004 model (768 dimensions) for semantic understanding.

Four Search Interfaces

Access your indexed data through multiple interfaces:

CLI - Command-line search:
```
grepctl search "your query"
```
Web Interface - Interactive UI:
```
grepctl serve
```

Python Interface - Programmatic access:

from grepctl.search.vector_search import SemanticSearch
results = searcher.search("query", top_k=10)

SQL Interface - Direct BigQuery queries:

WITH query_embedding AS (
  SELECT ml_generate_embedding_result AS embedding
  FROM ML.GENERATE_EMBEDDING(
    MODEL `project.mmgrep.text_embedding_model`,
    (SELECT 'your search string' AS content)
  )
)
SELECT doc_id, text_content, distance AS score
FROM VECTOR_SEARCH(
  TABLE `project.mmgrep.search_corpus`,
  'embedding',
  (SELECT embedding FROM query_embedding),
  top_k => 10
)

All interfaces leverage BigQuery's VECTOR_SEARCH for sub-second semantic search across your entire data lake.

Detailed Processing Pipeline

1. Text Files (.txt, .log, .md)

Direct extraction from Google Cloud Storage via BigQuery's EXTERNAL_QUERY function
No intermediate processing needed - text content read directly into BigQuery tables
Content is chunked into 1000-character segments with 100-character overlap
Markdown structure preserved with heading hierarchy
Each chunk maintains context through overlapping windows

2. PDF Documents (.pdf)

Google Document AI performs OCR on all pages
Handles both text-based and scanned PDFs
Extracted text is chunked semantically by paragraphs
Page numbers and document structure preserved in metadata

3. Office Documents (.docx, .xlsx, .pptx)

Document AI extracts text content
Preserves document structure (headings, tables, slides)
Excel sheets converted to structured text representation
PowerPoint slides maintain slide order and notes

4. Audio Files (.mp3, .wav, .m4a, .flac)

Speech-to-Text API v2 provides accurate transcription
Automatic punctuation and speaker diarization
Supports long-form audio (up to 480 minutes)
Transcripts chunked by natural speech boundaries
Timestamps preserved for temporal search

5. Video Files (.mp4, .avi, .mov, .mkv)

Video Intelligence API analyzes visual content:
- Shot detection and scene changes
- Object tracking and label detection
- OCR on text appearing in frames
- Face and logo detection
Speech-to-Text transcribes audio track separately
Frame descriptions combined with transcripts
Temporal alignment between visual and audio elements

Each processing pipeline outputs structured text that is then embedded using Vertex AI's text-embedding-004 model, creating 768-dimensional vectors optimized for semantic similarity search.

SQL Interface Functions

Setup

The search functions are automatically created when you run:

grepctl setup

This creates the following functions in your BigQuery dataset:

search(query) - Simple search with defaults
semantic_search(query, top_k, min_relevance) - Full control search
search_by_source(query, sources, top_k) - Filter by file types
search_by_date(query, start_date, end_date, top_k) - Date range search
search_content(query, limit) - Just return content strings

Function Reference

Core Search Functions

-- Simple search (defaults: top_k=10, min_relevance=0.0)
CALL `your-project.grepmm.search`("your query");

-- Full semantic search
CALL `your-project.grepmm.semantic_search`(
    "query text",           -- Search query
    20,                     -- Number of results
    0.7                     -- Minimum relevance (0-1)
);

-- Returns:
-- doc_id, uri, source, modality, text_content,
-- relevance_score, created_at, metadata

Filtered Search Functions

-- Search by source types
CALL `your-project.grepmm.search_by_source`(
    "query",
    ["pdf", "markdown"],    -- Array of sources
    10                      -- Top K results
);

-- Search by date range
CALL `your-project.grepmm.search_by_date`(
    "query",
    DATE('2024-01-01'),     -- Start date
    CURRENT_DATE(),         -- End date
    15                      -- Top K results
);

-- Get just content
CALL `your-project.grepmm.search_content`(
    "query",
    5                       -- Limit
);

The functions handle all the complexity of embeddings and vector search - you just write simple SQL queries!

Python API Functions

Installation

# Using uv (recommended)
uv add grepctl

# Using pip
pip install grepctl

# For development
git clone <repo>
cd bq-semgrep
uv sync

Configuration

The SearchClient will automatically use your existing grepctl configuration from ~/.grepctl/config.yaml:

project_id: your-project
dataset_name: grepmm
location: us-central1

Or you can specify a custom config path:

client = SearchClient(config_path="/path/to/config.yaml")

Or override the project ID:

client = SearchClient(project_id="my-project-id")

API Reference

SearchClient Methods

# Full search with all options
results = client.search(
    query="search text",           # Search query
    top_k=10,                      # Number of results
    sources=['pdf', 'text'],       # Filter by source types
    rerank=False,                  # Use LLM reranking
    regex_filter=r"pattern",       # Regex filter
    start_date="2023-01-01",       # Date range start
    end_date="2024-12-31"          # Date range end
)

# Simple search - just returns content strings
contents = client.search_simple("query", limit=5)

# Get system statistics
stats = client.get_stats()

Convenience Function

from grepctl import search

# Quick search without client
results = search("query", top_k=10, rerank=True)

You now have a powerful, simple Python API for semantic search across all your data. The SearchClient handles all the complexity of BigQuery connections, embedding models, and vector search - you just focus on building great applications!

Documentation

Python Interface Guide - Complete examples and API reference for Python integration
SQL Interface Guide - BigQuery SQL functions and advanced query examples

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.4

Sep 22, 2025

0.3.3

Sep 22, 2025

0.3.2

Sep 22, 2025

0.3.1

Sep 22, 2025

This version

0.3.0

Sep 22, 2025

0.2.2

Sep 22, 2025

0.2.1

Sep 21, 2025

0.2.0

Sep 21, 2025

0.1.1

Sep 14, 2025

0.1.0

Sep 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grepctl-0.3.0.tar.gz (118.9 kB view details)

Uploaded Sep 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

grepctl-0.3.0-py3-none-any.whl (104.6 kB view details)

Uploaded Sep 22, 2025 Python 3

File details

Details for the file grepctl-0.3.0.tar.gz.

File metadata

Download URL: grepctl-0.3.0.tar.gz
Upload date: Sep 22, 2025
Size: 118.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for grepctl-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`ec2e687325d790be4c98c47130ebd24f6544244efdca3b48e261c36190b1dba6`
MD5	`cd3ffb69243612b6ed82a8dee189ab1a`
BLAKE2b-256	`0f1b0a4f76ca1a152716522d2ad74d35569ea7ec1c596ed016ab1705fd5130e8`

See more details on using hashes here.

File details

Details for the file grepctl-0.3.0-py3-none-any.whl.

File metadata

Download URL: grepctl-0.3.0-py3-none-any.whl
Upload date: Sep 22, 2025
Size: 104.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for grepctl-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`29ad918281b88815b9f6889bcd113c5dae4b5a640fb16a05d33ccef05d8c1b4d`
MD5	`42755e3d090db574b3e4e360cdb4012c`
BLAKE2b-256	`bccc3f562fe9202a16fd007a53f5dc9f29e2bbd521e10ba11653f17532457d80`

See more details on using hashes here.

grepctl 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

grepctl - Semantic Search For Your Data Lake

Data Modalities & Processing

Four Search Interfaces

Detailed Processing Pipeline

1. Text Files (.txt, .log, .md)

2. PDF Documents (.pdf)

3. Office Documents (.docx, .xlsx, .pptx)

4. Audio Files (.mp3, .wav, .m4a, .flac)

5. Video Files (.mp4, .avi, .mov, .mkv)

SQL Interface Functions

Setup

Function Reference

Core Search Functions

Filtered Search Functions

Python API Functions

Installation

Configuration

API Reference

SearchClient Methods

Convenience Function

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes