Skip to main content

SUTRA

Project description

QuerySUTRA

SUTRA: Structured-Unstructured-Text-Retrieval-Architecture

AI-powered data analysis library. Upload PDFs, query with natural language, export to MySQL automatically.

Installation

pip install QuerySUTRA
pip install QuerySUTRA[mysql]       # For MySQL export
pip install QuerySUTRA[embeddings]  # For smart caching
pip install QuerySUTRA[all]         # All features

Quick Start

from sutra import SUTRA

sutra = SUTRA(api_key="your-openai-key")
sutra.upload("data.pdf")
result = sutra.ask("Show me all people")
print(result.data)

Core Features

1. Automatic MySQL Export

Database auto-created if not exists.

# Upload and export to MySQL automatically
sutra.upload("data.pdf", auto_export_mysql={
    'host': 'localhost',
    'user': 'root',
    'password': 'your_password',
    'database': 'my_database'  # Auto-creates
})

2. Complete Data Extraction

Processes entire PDF in chunks. Extracts ALL data.

sutra.upload("large_document.pdf")  # Extracts all 50+ employees
sutra.tables()

3. Natural Language Queries

result = sutra.ask("Show all people from California")
result = sutra.ask("Who has Python skills?")
result = sutra.ask("Count employees by state", viz="pie")

4. Custom Visualizations

result = sutra.ask("Sales by region", viz="pie")
result = sutra.ask("Trends", viz="line")
result = sutra.ask("Compare", viz="bar")
result = sutra.ask("Correlation", viz="scatter")
result = sutra.ask("Data", viz="table")
result = sutra.ask("Analysis", viz="heatmap")

5. Load Existing Databases

# Load SQLite
sutra = SUTRA.load_from_db("data.db", api_key="key")

# Connect to MySQL
sutra = SUTRA.connect_mysql("localhost", "root", "pass", "database")

# Connect to PostgreSQL  
sutra = SUTRA.connect_postgres("localhost", "postgres", "pass", "database")

6. Fuzzy Matching for Better NLP

Automatically matches similar terms.

sutra = SUTRA(api_key="your-key", fuzzy_match=True)

# "New York City" automatically matches "New York" in database
result = sutra.ask("Who are from New York City?")
# Output: Fuzzy: 'City' -> 'New York'

How it works:

  • Uses Python's difflib.get_close_matches
  • 60% similarity threshold
  • Matches query terms to actual database values
  • Example: "NYC" → "New York", "Cali" → "California"

7. Embeddings for Smart Caching (Saves API Calls)

Cache similar queries to save OpenAI API costs.

sutra = SUTRA(api_key="your-key", use_embeddings=True)

# First query - calls OpenAI API
result = sutra.ask("Show sales data")

# Similar query - uses cache (NO API call, FREE!)
result = sutra.ask("Display sales information")
# Output: Similar (92%): 'Show sales data'

How it works:

  • Uses sentence-transformers library
  • Model: all-MiniLM-L6-v2 (80MB, runs locally)
  • Converts queries to 384-dimensional vectors
  • Similarity threshold: 85%
  • Completely offline (no external API calls)

Technical details:

Query 1: "Show sales" → Vector: [0.23, -0.45, 0.67, ...]
Query 2: "Display sales" → Vector: [0.25, -0.43, 0.69, ...]
Similarity: 92% → Uses cached result (saves API call)

Query 3: "What's the weather?" → Vector: [-0.89, 0.12, -0.34, ...]
Similarity: 15% → New API call (different topic)

Cost savings:

# Without embeddings: 10 similar queries = 10 API calls = $0.10
# With embeddings: 10 similar queries = 1 API call = $0.01 (90% savings)

8. Irrelevant Query Detection

Detects when queries don't relate to your database.

sutra = SUTRA(api_key="your-key", check_relevance=True)

result = sutra.ask("What is the weather today?")
# Output: Warning: Query may be irrelevant to your database
#         Database contains tables about: employee_data_people, employee_data_skills
#         Continue anyway? (yes/no):

How it works:

  • Sends database context (table names, column names) to AI
  • AI determines if query is relevant
  • Prompts user before wasting API call
  • Can proceed anyway if desired

9. Query Caching

Simple caching for exact query matches.

sutra = SUTRA(api_key="your-key", cache_queries=True)

result = sutra.ask("Show total sales")  # API call
result = sutra.ask("Show total sales")  # From cache (FREE)
# Output: From cache

10. Direct SQL (Free, No API Cost)

result = sutra.sql("SELECT * FROM people WHERE state='CA'")
print(result.data)

Advanced Configuration

Enable all optional features:

sutra = SUTRA(
    api_key="your-openai-key",
    db="database.db",              # SQLite database path
    use_embeddings=True,           # Smart caching with embeddings (saves 90% API costs)
    check_relevance=True,          # Detect irrelevant queries before API call
    fuzzy_match=True,              # Better NLP matching
    cache_queries=True             # Cache exact query matches
)

Feature comparison:

Feature Benefit When to Use
use_embeddings=True Saves 90% on API costs for similar queries Always recommended
fuzzy_match=True Better query matching When data has city/location names
check_relevance=True Prevents wasted API calls When users ask random questions
cache_queries=True Saves on exact query repeats Always recommended

Import SQLite to MySQL

Step 1: In Colab - Export Database

sutra.upload("data.pdf")
sutra.export_db("my_data.db", format="sqlite")

from google.colab import files
files.download("my_data.db")

Step 2: On Windows - Import to MySQL

Method A: Using QuerySUTRA

sutra = SUTRA.load_from_db("my_data.db", api_key="key")
sutra.save_to_mysql("localhost", "root", "password", "my_database")

Method B: Using simple_import.py script

Create simple_import.py:

import sqlite3, mysql.connector, pandas as pd

SQLITE_DB = "my_data.db"
MYSQL_HOST, MYSQL_USER, MYSQL_PASSWORD = "localhost", "root", "password"
MYSQL_DATABASE = "my_database"

sqlite_conn = sqlite3.connect(SQLITE_DB)
cursor = sqlite_conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [row[0] for row in cursor.fetchall()]

temp_conn = mysql.connector.connect(host=MYSQL_HOST, user=MYSQL_USER, password=MYSQL_PASSWORD)
temp_cursor = temp_conn.cursor()
temp_cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{MYSQL_DATABASE}`")
temp_cursor.close()
temp_conn.close()

mysql_conn = mysql.connector.connect(host=MYSQL_HOST, user=MYSQL_USER, password=MYSQL_PASSWORD, database=MYSQL_DATABASE)
mysql_cursor = mysql_conn.cursor()

for table in tables:
    df = pd.read_sql_query(f"SELECT * FROM {table}", sqlite_conn)
    mysql_cursor.execute(f"DROP TABLE IF EXISTS {table}")
    
    cols = [f"`{col}` {'INT' if df[col].dtype == 'int64' else 'FLOAT' if df[col].dtype == 'float64' else 'TEXT'}" for col in df.columns]
    mysql_cursor.execute(f"CREATE TABLE {table} ({', '.join(cols)})")
    
    if len(df) > 0:
        placeholders = ', '.join(['%s'] * len(df.columns))
        for _, row in df.iterrows():
            vals = [None if pd.isna(v) else v for v in row.values]
            mysql_cursor.execute(f"INSERT INTO {table} VALUES ({placeholders})", vals)
    mysql_conn.commit()

sqlite_conn.close()
mysql_cursor.close()
mysql_conn.close()
print(f"Complete! Data in MySQL database '{MYSQL_DATABASE}'")

Run: python simple_import.py

Supported Formats

CSV, Excel, JSON, SQL, PDF, Word, Text, Pandas DataFrame

How Embeddings Work

QuerySUTRA uses sentence-transformers to create semantic embeddings of your queries:

Model: all-MiniLM-L6-v2

  • Size: 80MB (downloads once, cached locally)
  • Embedding dimension: 384
  • Speed: Very fast, runs locally
  • No external API calls

Process:

  1. Query is converted to a 384-dimensional vector
  2. Compared to cached query vectors using cosine similarity
  3. If similarity > 85%, uses cached result
  4. Otherwise, makes new API call

Example:

sutra = SUTRA(api_key="key", use_embeddings=True)

# Query 1: "Show me sales data"
# → Embedding: [0.234, -0.456, 0.678, -0.123, ...]
# → API call made
# → Result cached

# Query 2: "Display sales information"  
# → Embedding: [0.238, -0.451, 0.682, -0.119, ...]
# → Similarity: 92% with Query 1
# → Uses cached result (NO API CALL)

# Query 3: "What's the weather?"
# → Embedding: [-0.891, 0.123, -0.345, 0.567, ...]
# → Similarity: 15% with Query 1
# → Makes new API call (different topic)

Cost Comparison:

Without embeddings:

10 queries about sales = 10 API calls = $0.10

With embeddings:

10 similar queries about sales = 1 API call + 9 cached = $0.01 (90% savings)

Installation:

pip install QuerySUTRA[embeddings]

Usage:

sutra = SUTRA(api_key="key", use_embeddings=True)

# All similar queries are cached automatically
result1 = sutra.ask("Show sales")
result2 = sutra.ask("Display sales data")      # Cached
result3 = sutra.ask("Give me sales information") # Cached
result4 = sutra.ask("Sales data please")        # Cached
# Only 1 API call for all 4 queries!

API Reference

Initialize

SUTRA(
    api_key: str,                  # OpenAI API key
    db: str = "sutra.db",         # SQLite database path
    use_embeddings: bool = False,  # Enable smart caching
    check_relevance: bool = False, # Check query relevance
    fuzzy_match: bool = True,      # Enable fuzzy matching
    cache_queries: bool = True     # Cache exact matches
)

Class Methods

  • load_from_db(path, api_key, **kwargs) - Load existing SQLite database
  • connect_mysql(host, user, password, database, port, api_key, **kwargs) - Connect to MySQL
  • connect_postgres(host, user, password, database, port, api_key, **kwargs) - Connect to PostgreSQL

Instance Methods

  • upload(data, name, extract_entities, auto_export_mysql) - Upload data
  • ask(question, viz, table) - Natural language query
  • sql(query, viz) - Direct SQL query
  • tables() - List all tables
  • schema(table) - Show schema
  • peek(table, n) - Preview data
  • export_db(path, format) - Export (sqlite/sql/json/excel)
  • save_to_mysql(...) - Export to MySQL (auto-creates database)
  • save_to_postgres(...) - Export to PostgreSQL
  • backup(path) - Create backup
  • close() - Close connection

Troubleshooting

MySQL database doesn't exist

  • Fixed - auto-creates automatically

Only 10 records from large PDF

  • Fixed - processes entire document in chunks

connect_mysql() not found

  • Update: pip install --upgrade QuerySUTRA

Embeddings not working

  • Install: pip install QuerySUTRA[embeddings]

Requirements

  • Python 3.8+
  • OpenAI API key
  • MySQL/PostgreSQL (optional)

License

MIT License

Changelog

v0.4.2

  • Complete embeddings documentation
  • Simplified workflows

v0.4.0

  • Auto-creates MySQL database
  • Complete PDF extraction
  • Chunk processing
  • Auto-export feature

Made by Aditya Batta

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

querysutra-0.4.3.tar.gz (45.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

querysutra-0.4.3-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file querysutra-0.4.3.tar.gz.

File metadata

  • Download URL: querysutra-0.4.3.tar.gz
  • Upload date:
  • Size: 45.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for querysutra-0.4.3.tar.gz
Algorithm Hash digest
SHA256 6ed1b7fb8967f069f7175ff2fe4b029157ddb3932a139d938eb2adb0868b9760
MD5 4eecc1ceda8ced527b5ea8ef14190118
BLAKE2b-256 4451c9b51981452405bf4e7eb53c6cbbda373846ffd33496e76db3fbd99b6a04

See more details on using hashes here.

File details

Details for the file querysutra-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: querysutra-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for querysutra-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bcc07130d6b1762c47bab6ebdaf809f271614f01a574f2ed3bf343b621cf3405
MD5 edee7e24fcb2e3636549b0ea5e1ad110
BLAKE2b-256 5da671ac9c350acf19d3195ac59af9c9637459bef4e973a9805fa6a66fd06c94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page