One-command orchestration for multimodal semantic search in BigQuery
Project description
๐ One-command multimodal semantic search across your entire data lake using BigQuery ML and Google Cloud AI.
๐ฆ Installation
# Install from PyPI
pip install grepctl
๐ฏ Quick Start - From Zero to Search in One Command
# Complete setup with automatic data ingestion
grepctl init all --bucket your-bucket --auto-ingest
# Start searching immediately
grepctl search "find all mentions of machine learning"
That's it! The system automatically:
- โ Enables all required Google Cloud APIs
- โ Creates BigQuery dataset and tables
- โ Deploys Vertex AI embedding models
- โ Ingests all 8 data modalities from your GCS bucket
- โ Generates 768-dimensional embeddings
- โ Configures semantic search with VECTOR_SEARCH
๐ What is grepctl?
grepctl is a powerful command-line orchestration tool that transforms your Google Cloud Storage data lake into a searchable knowledge base. It provides a unified interface for searching across 8 different data types:
- ๐ Text & Markdown - Direct content extraction
- ๐ PDF Documents - OCR with Document AI
- ๐ผ๏ธ Images - Vision API analysis (labels, text, objects, faces)
- ๐ต Audio Files - Speech-to-Text transcription
- ๐ฌ Video Files - Video Intelligence analysis
- ๐ JSON & CSV - Structured data parsing
All searchable through semantic understanding, not just keywords!
๐๏ธ Architecture Overview
๐ ๏ธ Installation & Setup
Prerequisites
- Google Cloud Project with billing enabled
- Python 3.11+
- gcloud CLI authenticated with appropriate permissions
Install from PyPI
# Install the package
pip install grepctl
# Verify installation
grepctl --help
Install from Source
# Clone repository
git clone https://github.com/gregorymulla/grepctl.git
cd grepctl
# Install with uv (recommended)
uv sync
uv run grepctl --help
# Or install with pip
pip install -e .
grepctl --help
Complete System Setup
Option 1: Fully Automated (Recommended)
# One command does everything!
grepctl init all --bucket your-bucket --auto-ingest
# This single command:
# 1. Enables 7 Google Cloud APIs
# 2. Creates BigQuery dataset and 3 tables
# 3. Deploys 3 Vertex AI models
# 4. Ingests all files from GCS
# 5. Generates embeddings
# 6. Sets up semantic search
Option 2: Step-by-Step Control
# Enable APIs
grepctl apis enable --all
# Initialize BigQuery
grepctl init dataset
grepctl init models
# Ingest data
grepctl ingest all
# Generate embeddings
grepctl index update
# Start searching
grepctl search "your query"
๐ Using the System
Web User Interface
Start the web server with a beautiful, responsive search interface:
# Start the web server (default port 8000)
grepctl serve
# Start with custom port
grepctl serve --port 3000
# Start with auto-reload for development
grepctl serve --reload
Then open your browser to http://localhost:8000 to access:
- ๐จ Clean, modern interface with your company logo
- ๐ฑ Mobile-friendly responsive design - Works seamlessly on phones, tablets, and desktops
- โก Real-time search across all modalities
- ๐ Live system status and document count
- ๐ Relevance scoring and result highlighting
- ๐ Touch-optimized interface for mobile devices
Command Line Interface
# Search across all data
grepctl search "machine learning algorithms"
# Search specific modalities
grepctl search "error handling" -k 20 -m pdf -m markdown
# Check system status
grepctl status
# View all available commands
grepctl --help
grepctl search "bird"
SQL Interface
-- Direct semantic search
WITH query_embedding AS (
SELECT ml_generate_embedding_result AS embedding
FROM ML.GENERATE_EMBEDDING(
MODEL `your-project.mmgrep.text_embedding_model`,
(SELECT 'machine learning' AS content),
STRUCT(TRUE AS flatten_json_output)
)
)
SELECT doc_id, source, text_content, distance AS score
FROM VECTOR_SEARCH(
TABLE `your-project.mmgrep.search_corpus`,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => 10
)
ORDER BY distance;
Python API (When installed from source)
from grepctl.search.vector_search import SemanticSearch
from grepctl.bigquery.connection import BigQueryClient
from grepctl.config import load_config
# Load configuration
config = load_config()
client = BigQueryClient(config)
# Initialize searcher
searcher = SemanticSearch(client, config)
# Search across all modalities
results = searcher.search(
query="neural networks",
top_k=20,
source_filter=['pdf', 'images'],
use_rerank=True
)
๐ System Capabilities
Current Status (Production Ready)
- โ 425+ documents indexed across 8 modalities
- โ 768-dimensional embeddings for semantic understanding
- โ Sub-second query response times
- โ 100% embedding coverage for all documents
- โ 5 Google Cloud APIs integrated
- โ Auto-recovery from embedding issues
Supported Operations
| Operation | Command | Description |
|---|---|---|
| Setup | grepctl init all --auto-ingest |
Complete one-command setup |
| Ingest | grepctl ingest all |
Process all file types |
| Index | grepctl index update |
Generate embeddings |
| Fix | grepctl fix embeddings |
Auto-fix dimension issues |
| Search | grepctl search "query" |
Semantic search |
| Status | grepctl status |
System health check |
๐งฐ grepctl Commands
Complete CLI Management
# System initialization
grepctl init all --bucket your-bucket --auto-ingest
# API management
grepctl apis enable --all
grepctl apis check
# Data ingestion
grepctl ingest pdf # Process PDFs
grepctl ingest images # Analyze images with Vision API
grepctl ingest audio # Transcribe audio
grepctl ingest video # Analyze videos
# Index management
grepctl index rebuild # Rebuild from scratch
grepctl index update # Update missing embeddings
grepctl index verify # Check embedding health
# Troubleshooting
grepctl fix embeddings # Fix dimension issues
grepctl fix stuck # Handle stuck processing
grepctl fix validate # Check data integrity
# Search
grepctl search "query" -k 20 -o json
# Web Interface (NEW!)
grepctl serve # Start web UI at http://localhost:8000
grepctl serve --port 3000 --theme-config ./my-theme.yaml
Configuration
grepctl uses ~/.grepctl.yaml for configuration:
project_id: your-project
dataset: mmgrep
bucket: your-bucket
location: US
batch_size: 100
chunk_size: 1000
๐ Supported Data Types
| Modality | Extensions | Processing Method | Google API Used |
|---|---|---|---|
| Text | .txt, .log | Direct extraction | โ |
| Markdown | .md | Markdown parsing | โ |
| OCR extraction | Document AI | ||
| Images | .jpg, .png, .gif | Visual analysis | Vision API |
| Audio | .mp3, .wav, .m4a | Transcription | Speech-to-Text |
| Video | .mp4, .avi, .mov | Frame + audio analysis | Video Intelligence |
| JSON | .json, .jsonl | Structured parsing | โ |
| CSV | .csv, .tsv | Tabular analysis | โ |
๐ Web Interface
grepctl now includes a beautiful, customizable web UI for semantic search!
Starting the Web Interface
# Quick start - serves at http://localhost:8000
grepctl serve
# Custom port and theme
grepctl serve --port 3000 --theme-config ./my-company-theme.yaml
# Development mode with auto-reload
grepctl serve --reload
Features
- ๐จ Fully Customizable - Change colors, logo, and branding
- ๐ Dark Mode - Built-in dark/light theme support
- ๐ Google-like Interface - Familiar, intuitive search experience
- โก Real-time Search - Search as you type with debouncing
- ๐ Advanced Filters - Filter by modality, date, and more
- ๐ฑ Responsive Design - Works on desktop, tablet, and mobile
- โจ๏ธ Keyboard Shortcuts - Press
/to focus search
Customization
See UI Customization Guide for detailed instructions on:
- Adding your company logo
- Changing color schemes
- Creating custom themes
- White-labeling the interface
Quick customization example:
# my-company-theme.yaml
branding:
companyName: "Acme Corp"
logo: "/static/acme-logo.svg"
colors:
primary: "#FF5722"
secondary: "#FFC107"
๐ Advanced Features
Multimodal Search
Search across all data types simultaneously:
# Find mentions across PDFs, images, and videos
grepctl search "quarterly revenue" -m pdf -m images -m video
Automatic Processing
- Vision API extracts text, labels, objects from images
- Document AI performs OCR on scanned PDFs
- Speech-to-Text transcribes audio with punctuation
- Video Intelligence analyzes frames and transcribes speech
Error Recovery
# Automatic fix for common issues
grepctl fix embeddings # Fixes dimension mismatches
grepctl fix stuck # Clears stuck processing
๐ง Troubleshooting
Common Issues & Solutions
| Issue | Solution |
|---|---|
| "Permission denied" | Run gcloud auth login and ensure BigQuery Admin role |
| "Dataset not found" | Run grepctl init dataset |
| "Embedding dimension mismatch" | Run grepctl fix embeddings |
| "No search results" | Check grepctl status and run grepctl index update |
| "API not enabled" | Run grepctl apis enable --all |
Quick Diagnostics
# Check everything
grepctl status
# Verify APIs
grepctl apis check
# Check embeddings
grepctl index verify
# Fix any issues
grepctl fix embeddings
๐ฏ Example Use Cases
- Code Search: Find code patterns across repositories
- Document Discovery: Search PDFs for specific topics
- Media Analysis: Find content in images and videos
- Log Analysis: Semantic search through log files
- Data Mining: Query structured data semantically
๐ Performance
- Ingestion: ~50 docs/second for text
- Embedding Generation: ~20 docs/second
- Search Latency: <1 second for most queries
- Storage: ~500MB for 425+ documents
- Accuracy: 768-dimensional embeddings for semantic precision
๐ฆ Package Information
- PyPI: https://pypi.org/project/grepctl/
- Version: 0.1.0
- Requirements: Python 3.11+, Google Cloud Project
- License: MIT
๐ค Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Development Setup
# Clone the repository
git clone https://github.com/yourusername/grepctl.git
cd grepctl
# Install in development mode with uv
uv sync
uv add --dev pytest black flake8
# Run tests
uv run pytest
# Format code
uv run black .
๐ License
MIT License - see LICENSE for details.
๐ Acknowledgments
Built with:
- Google BigQuery ML
- Vertex AI (text-embedding-004)
- Google Cloud Vision, Document AI, Speech-to-Text, Video Intelligence APIs
- Python, Click, and Rich CLI libraries
๐ Citation
If you use grepctl in your research or project, please cite:
@software{grepctl2024,
title = {grepctl: One-Command Orchestration for Multimodal Semantic Search in BigQuery},
author = {Mulla, Gregory},
year = {2024},
url = {https://github.com/yourusername/grepctl},
version = {0.1.0}
}
Ready to transform your data lake into a searchable knowledge base?
pip install grepctl
grepctl init all --bucket your-bucket --auto-ingest
๐ That's all it takes!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grepctl-0.2.0.tar.gz.
File metadata
- Download URL: grepctl-0.2.0.tar.gz
- Upload date:
- Size: 74.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72f2b5d71179c5229ff6ca8af017c90d168360c21b7927df6adccec0611036b5
|
|
| MD5 |
99271c59604d89eea9958825e98f34ab
|
|
| BLAKE2b-256 |
c5842294008a591c0514c4c4712f6d075eefbb9c7c7377ea60fe8ddb210e745a
|
File details
Details for the file grepctl-0.2.0-py3-none-any.whl.
File metadata
- Download URL: grepctl-0.2.0-py3-none-any.whl
- Upload date:
- Size: 60.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cbe9074cd1bf03c0d89ceed34fcb5b4696248c80e524de0dbbe04b146378dd7
|
|
| MD5 |
d6db99574cfd387538a861ac84fca698
|
|
| BLAKE2b-256 |
1f0c8c5f2c1d0a893d925c19ef875c510889f0dd58d7b00dc4a3663284a99d5b
|