One-command orchestration for multimodal semantic search in BigQuery
Project description
grepctl is a utility that converts unstructured content in your data lake into a semantically searchable index with a single command:
grepctl ingest -b <bucket>
Data Modalities & Processing
grepctl processes 9 different data types automatically:
| Modality | Processing Method |
|---|---|
| Text/Markdown | Direct content extraction, preserving structure |
| OCR via Google Document AI for text extraction | |
| Office Documents | Document AI extracts content from .docx, .xlsx, .pptx |
| Images | Vision API extracts labels, text, objects, and faces |
| Audio | Speech-to-Text API transcribes to searchable text |
| Video | Video Intelligence API analyzes frames and transcribes speech |
| JSON/CSV | Structured data parsing with field preservation |
Each modality is converted to text, chunked intelligently, and embedded using Vertex AI's text-embedding-004 model (768 dimensions) for semantic understanding.
Four Search Interfaces
Access your indexed data through multiple interfaces:
-
CLI - Command-line search:
grepctl search "your query"
-
Web Interface - Interactive UI:
grepctl serve -
Python Interface - Programmatic access:
from grepctl.search.vector_search import SemanticSearch results = searcher.search("query", top_k=10)
-
SQL Interface - Direct BigQuery queries:
WITH query_embedding AS ( SELECT ml_generate_embedding_result AS embedding FROM ML.GENERATE_EMBEDDING( MODEL `project.mmgrep.text_embedding_model`, (SELECT 'your search string' AS content) ) ) SELECT doc_id, text_content, distance AS score FROM VECTOR_SEARCH( TABLE `project.mmgrep.search_corpus`, 'embedding', (SELECT embedding FROM query_embedding), top_k => 10 )
All interfaces leverage BigQuery's VECTOR_SEARCH for sub-second semantic search across your entire data lake.
Detailed Processing Pipeline
1. Text Files (.txt, .log, .md)
- Direct extraction from Google Cloud Storage via BigQuery's
EXTERNAL_QUERYfunction - No intermediate processing needed - text content read directly into BigQuery tables
- Content is chunked into 1000-character segments with 100-character overlap
- Markdown structure preserved with heading hierarchy
- Each chunk maintains context through overlapping windows
2. PDF Documents (.pdf)
- Google Document AI performs OCR on all pages
- Handles both text-based and scanned PDFs
- Extracted text is chunked semantically by paragraphs
- Page numbers and document structure preserved in metadata
3. Office Documents (.docx, .xlsx, .pptx)
- Document AI extracts text content
- Preserves document structure (headings, tables, slides)
- Excel sheets converted to structured text representation
- PowerPoint slides maintain slide order and notes
4. Audio Files (.mp3, .wav, .m4a, .flac)
- Speech-to-Text API v2 provides accurate transcription
- Automatic punctuation and speaker diarization
- Supports long-form audio (up to 480 minutes)
- Transcripts chunked by natural speech boundaries
- Timestamps preserved for temporal search
5. Video Files (.mp4, .avi, .mov, .mkv)
- Video Intelligence API analyzes visual content:
- Shot detection and scene changes
- Object tracking and label detection
- OCR on text appearing in frames
- Face and logo detection
- Speech-to-Text transcribes audio track separately
- Frame descriptions combined with transcripts
- Temporal alignment between visual and audio elements
Each processing pipeline outputs structured text that is then embedded using Vertex AI's text-embedding-004 model, creating 768-dimensional vectors optimized for semantic similarity search.
SQL Interface Functions
Setup
The search functions are automatically created when you run:
grepctl setup
This creates the following functions in your BigQuery dataset:
search(query)- Simple search with defaultssemantic_search(query, top_k, min_relevance)- Full control searchsearch_by_source(query, sources, top_k)- Filter by file typessearch_by_date(query, start_date, end_date, top_k)- Date range searchsearch_content(query, limit)- Just return content strings
Function Reference
Core Search Functions
-- Simple search (defaults: top_k=10, min_relevance=0.0)
CALL `your-project.grepmm.search`("your query");
-- Full semantic search
CALL `your-project.grepmm.semantic_search`(
"query text", -- Search query
20, -- Number of results
0.7 -- Minimum relevance (0-1)
);
-- Returns:
-- doc_id, uri, source, modality, text_content,
-- relevance_score, created_at, metadata
Filtered Search Functions
-- Search by source types
CALL `your-project.grepmm.search_by_source`(
"query",
["pdf", "markdown"], -- Array of sources
10 -- Top K results
);
-- Search by date range
CALL `your-project.grepmm.search_by_date`(
"query",
DATE('2024-01-01'), -- Start date
CURRENT_DATE(), -- End date
15 -- Top K results
);
-- Get just content
CALL `your-project.grepmm.search_content`(
"query",
5 -- Limit
);
The functions handle all the complexity of embeddings and vector search - you just write simple SQL queries!
Python API Functions
Installation
# Using uv (recommended)
uv add grepctl
# Using pip
pip install grepctl
# For development
git clone <repo>
cd bq-semgrep
uv sync
Configuration
The SearchClient will automatically use your existing grepctl configuration from ~/.grepctl/config.yaml:
project_id: your-project
dataset_name: grepmm
location: us-central1
Or you can specify a custom config path:
client = SearchClient(config_path="/path/to/config.yaml")
Or override the project ID:
client = SearchClient(project_id="my-project-id")
API Reference
SearchClient Methods
# Full search with all options
results = client.search(
query="search text", # Search query
top_k=10, # Number of results
sources=['pdf', 'text'], # Filter by source types
rerank=False, # Use LLM reranking
regex_filter=r"pattern", # Regex filter
start_date="2023-01-01", # Date range start
end_date="2024-12-31" # Date range end
)
# Simple search - just returns content strings
contents = client.search_simple("query", limit=5)
# Get system statistics
stats = client.get_stats()
Convenience Function
from grepctl import search
# Quick search without client
results = search("query", top_k=10, rerank=True)
You now have a powerful, simple Python API for semantic search across all your data. The SearchClient handles all the complexity of BigQuery connections, embedding models, and vector search - you just focus on building great applications!
Documentation
- Python Interface Guide - Complete examples and API reference for Python integration
- SQL Interface Guide - BigQuery SQL functions and advanced query examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grepctl-0.3.0.tar.gz.
File metadata
- Download URL: grepctl-0.3.0.tar.gz
- Upload date:
- Size: 118.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec2e687325d790be4c98c47130ebd24f6544244efdca3b48e261c36190b1dba6
|
|
| MD5 |
cd3ffb69243612b6ed82a8dee189ab1a
|
|
| BLAKE2b-256 |
0f1b0a4f76ca1a152716522d2ad74d35569ea7ec1c596ed016ab1705fd5130e8
|
File details
Details for the file grepctl-0.3.0-py3-none-any.whl.
File metadata
- Download URL: grepctl-0.3.0-py3-none-any.whl
- Upload date:
- Size: 104.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29ad918281b88815b9f6889bcd113c5dae4b5a640fb16a05d33ccef05d8c1b4d
|
|
| MD5 |
42755e3d090db574b3e4e360cdb4012c
|
|
| BLAKE2b-256 |
bccc3f562fe9202a16fd007a53f5dc9f29e2bbd521e10ba11653f17532457d80
|