Skip to main content

One-command orchestration for multimodal semantic search in BigQuery

Project description

BigQuery Semantic Grep (bq-semgrep)

๐Ÿš€ One-command multimodal semantic search across your entire data lake using BigQuery ML and Google Cloud AI.

๐ŸŽฏ Quick Start - From Zero to Search in One Command

# Complete setup with automatic data ingestion
grepctl init all --bucket your-bucket --auto-ingest

# Start searching immediately
grepctl search "find all mentions of machine learning"

That's it! The system automatically:

  • โœ… Enables all required Google Cloud APIs
  • โœ… Creates BigQuery dataset and tables
  • โœ… Deploys Vertex AI embedding models
  • โœ… Ingests all 8 data modalities from your GCS bucket
  • โœ… Generates 768-dimensional embeddings
  • โœ… Configures semantic search with VECTOR_SEARCH

๐Ÿ“Š What is BigQuery Semantic Grep?

A unified SQL interface for searching across 8 different data types stored in Google Cloud Storage:

  • ๐Ÿ“„ Text & Markdown - Direct content extraction
  • ๐Ÿ“‘ PDF Documents - OCR with Document AI
  • ๐Ÿ–ผ๏ธ Images - Vision API analysis (labels, text, objects, faces)
  • ๐ŸŽต Audio Files - Speech-to-Text transcription
  • ๐ŸŽฌ Video Files - Video Intelligence analysis
  • ๐Ÿ“Š JSON & CSV - Structured data parsing

All searchable through semantic understanding, not just keywords!

๐Ÿ—๏ธ Architecture Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     GCS DATA LAKE                           โ”‚
โ”‚                    (Your Documents)                         โ”‚
โ”‚  ๐Ÿ“„ Text  ๐Ÿ“‘ PDF  ๐Ÿ–ผ๏ธ Images  ๐ŸŽต Audio  ๐ŸŽฌ Video  ๐Ÿ“Š Data    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  grepctl  โ”‚ โ† One command orchestration
                    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ–ผ                 โ–ผ                 โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Ingestion    โ”‚  โ”‚ Google APIs  โ”‚  โ”‚ Processing   โ”‚
โ”‚ โ€ข 6 scripts  โ”‚  โ”‚ โ€ข Vision     โ”‚  โ”‚ โ€ข Extract    โ”‚
โ”‚ โ€ข All types  โ”‚  โ”‚ โ€ข Speech     โ”‚  โ”‚ โ€ข Transform  โ”‚
โ”‚              โ”‚  โ”‚ โ€ข Video      โ”‚  โ”‚ โ€ข Enrich     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚  BigQuery Dataset   โ”‚
                โ”‚   search_corpus     โ”‚
                โ”‚  425+ documents     โ”‚
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚   Vertex AI         โ”‚
                โ”‚ text-embedding-004  โ”‚
                โ”‚  768 dimensions     โ”‚
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚  Semantic Search    โ”‚
                โ”‚   VECTOR_SEARCH     โ”‚
                โ”‚  <1 second query    โ”‚
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ› ๏ธ Installation & Setup

Prerequisites

  1. Google Cloud Project with billing enabled
  2. Python 3.11+ and uv package manager
  3. gcloud CLI authenticated

Install grepctl

# Clone repository
git clone https://github.com/yourusername/bq-semgrep.git
cd bq-semgrep

# Install dependencies
uv sync

# Verify installation
uv run python grepctl.py --help

Complete System Setup

Option 1: Fully Automated (Recommended)

# One command does everything!
uv run python grepctl.py init all --bucket your-bucket --auto-ingest

# This single command:
# 1. Enables 7 Google Cloud APIs
# 2. Creates BigQuery dataset and 3 tables
# 3. Deploys 3 Vertex AI models
# 4. Ingests all files from GCS
# 5. Generates embeddings
# 6. Sets up semantic search

Option 2: Step-by-Step Control

# Enable APIs
grepctl apis enable --all

# Initialize BigQuery
grepctl init dataset
grepctl init models

# Ingest data
grepctl ingest all

# Generate embeddings
grepctl index update

# Start searching
grepctl search "your query"

๐Ÿ” Using the System

Command Line Interface

# Search with grepctl
grepctl search "machine learning algorithms"
grepctl search "error handling" -k 20 -m pdf -m markdown

# Search with bq-semgrep
uv run bq-semgrep search "data visualization" --top-k 10 --rerank

# Check system status
grepctl status

SQL Interface

-- Direct semantic search
WITH query_embedding AS (
  SELECT ml_generate_embedding_result AS embedding
  FROM ML.GENERATE_EMBEDDING(
    MODEL `your-project.mmgrep.text_embedding_model`,
    (SELECT 'machine learning' AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
)
SELECT doc_id, source, text_content, distance AS score
FROM VECTOR_SEARCH(
  TABLE `your-project.mmgrep.search_corpus`,
  'embedding',
  (SELECT embedding FROM query_embedding),
  top_k => 10
)
ORDER BY distance;

Python API

from bq_semgrep.search.vector_search import SemanticSearch

# Initialize searcher
searcher = SemanticSearch(client, config)

# Search across all modalities
results = searcher.search(
    query="neural networks",
    top_k=20,
    source_filter=['pdf', 'images'],
    use_rerank=True
)

๐Ÿ“ˆ System Capabilities

Current Status (Production Ready)

  • โœ… 425+ documents indexed across 8 modalities
  • โœ… 768-dimensional embeddings for semantic understanding
  • โœ… Sub-second query response times
  • โœ… 100% embedding coverage for all documents
  • โœ… 5 Google Cloud APIs integrated
  • โœ… Auto-recovery from embedding issues

Supported Operations

Operation Command Description
Setup grepctl init all --auto-ingest Complete one-command setup
Ingest grepctl ingest all Process all file types
Index grepctl index update Generate embeddings
Fix grepctl fix embeddings Auto-fix dimension issues
Search grepctl search "query" Semantic search
Status grepctl status System health check

๐Ÿงฐ Management Tools

grepctl - Complete CLI Management

# System initialization
grepctl init all --bucket your-bucket --auto-ingest

# API management
grepctl apis enable --all
grepctl apis check

# Data ingestion
grepctl ingest pdf        # Process PDFs
grepctl ingest images     # Analyze images with Vision API
grepctl ingest audio      # Transcribe audio
grepctl ingest video      # Analyze videos

# Index management
grepctl index rebuild     # Rebuild from scratch
grepctl index update      # Update missing embeddings
grepctl index verify      # Check embedding health

# Troubleshooting
grepctl fix embeddings    # Fix dimension issues
grepctl fix stuck         # Handle stuck processing
grepctl fix validate      # Check data integrity

# Search
grepctl search "query" -k 20 -o json

Configuration

grepctl uses ~/.grepctl.yaml for configuration:

project_id: your-project
dataset: mmgrep
bucket: your-bucket
location: US
batch_size: 100
chunk_size: 1000

๐Ÿ“Š Supported Data Types

Modality Extensions Processing Method Google API Used
Text .txt, .log Direct extraction โ€”
Markdown .md Markdown parsing โ€”
PDF .pdf OCR extraction Document AI
Images .jpg, .png, .gif Visual analysis Vision API
Audio .mp3, .wav, .m4a Transcription Speech-to-Text
Video .mp4, .avi, .mov Frame + audio analysis Video Intelligence
JSON .json, .jsonl Structured parsing โ€”
CSV .csv, .tsv Tabular analysis โ€”

๐Ÿš€ Advanced Features

Multimodal Search

Search across all data types simultaneously:

# Find mentions across PDFs, images, and videos
grepctl search "quarterly revenue" -m pdf -m images -m video

Automatic Processing

  • Vision API extracts text, labels, objects from images
  • Document AI performs OCR on scanned PDFs
  • Speech-to-Text transcribes audio with punctuation
  • Video Intelligence analyzes frames and transcribes speech

Error Recovery

# Automatic fix for common issues
grepctl fix embeddings    # Fixes dimension mismatches
grepctl fix stuck         # Clears stuck processing

๐Ÿ“š Documentation

๐Ÿ”ง Troubleshooting

Common Issues & Solutions

Issue Solution
"Permission denied" Run gcloud auth login and ensure BigQuery Admin role
"Dataset not found" Run grepctl init dataset
"Embedding dimension mismatch" Run grepctl fix embeddings
"No search results" Check grepctl status and run grepctl index update
"API not enabled" Run grepctl apis enable --all

Quick Diagnostics

# Check everything
grepctl status

# Verify APIs
grepctl apis check

# Check embeddings
grepctl index verify

# Fix any issues
grepctl fix embeddings

๐ŸŽฏ Example Use Cases

  1. Code Search: Find code patterns across repositories
  2. Document Discovery: Search PDFs for specific topics
  3. Media Analysis: Find content in images and videos
  4. Log Analysis: Semantic search through log files
  5. Data Mining: Query structured data semantically

๐Ÿ“ˆ Performance

  • Ingestion: ~50 docs/second for text
  • Embedding Generation: ~20 docs/second
  • Search Latency: <1 second for most queries
  • Storage: ~500MB for 425+ documents
  • Accuracy: 768-dimensional embeddings for semantic precision

๐Ÿค Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

๐Ÿ“„ License

MIT License - see LICENSE for details.

๐Ÿ™ Acknowledgments

Built with:

  • Google BigQuery ML
  • Vertex AI (text-embedding-004)
  • Google Cloud Vision, Document AI, Speech-to-Text, Video Intelligence APIs
  • Python, uv, and rich CLI library

Ready to search your entire data lake semantically?

grepctl init all --bucket your-bucket --auto-ingest

๐ŸŽ‰ That's all it takes!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grepctl-0.1.0.tar.gz (39.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grepctl-0.1.0-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file grepctl-0.1.0.tar.gz.

File metadata

  • Download URL: grepctl-0.1.0.tar.gz
  • Upload date:
  • Size: 39.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for grepctl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6c11b100da41307125b88954462096c807c894e7d342db4da210e13a0723a01e
MD5 a25e1be48d7d7bcaf3374ce2fe91ce94
BLAKE2b-256 9ca0562cb82b7e3cda9a4fb5835cc457c91b857fb3ae679198ddc79bd632adf5

See more details on using hashes here.

File details

Details for the file grepctl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: grepctl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for grepctl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e23781eaa3a96966d39909c030202d025f0ab14b888ad9870657439f8d9a805d
MD5 82ba8a12e358a4e9d2ce575d1561987c
BLAKE2b-256 957ba83a8b4f8f7502e52650a8b970cc4266a0f4b464be578646d569c5fbe4e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page