Skip to main content

SUTRA: Structured-Unstructured-Text-Retrieval-Architecture - Complete data extraction

Project description

QuerySUTRA

SUTRA: Structured-Unstructured-Text-Retrieval-Architecture

Professional Python library for AI-powered data analysis with automatic entity extraction, natural language querying, and intelligent caching.

Installation

pip install QuerySUTRA

# With optional features
pip install QuerySUTRA[embeddings]  # Smart caching
pip install QuerySUTRA[mysql]       # MySQL support
pip install QuerySUTRA[postgres]    # PostgreSQL support
pip install QuerySUTRA[all]         # All features

Key Features

1. Complete Data Extraction from Large Documents

Processes entire documents in chunks - no data loss on large PDFs.

from sutra import SUTRA

sutra = SUTRA(api_key="your-openai-key")
sutra.upload("large_document.pdf")  # Extracts ALL data, not just first page

# Automatically creates multiple tables:
# - document_people (40 rows, 8 columns)
# - document_skills (50 rows, 5 columns)
# - document_technologies (30 rows, 4 columns)
# - document_projects (25 rows, 6 columns)
# etc.

2. Natural Language Querying

result = sutra.ask("Show me all people from New York")
print(result.data)

# With visualization
result = sutra.ask("Show sales by region", viz="pie")

3. Load Existing Databases

# Load SQLite database
sutra = SUTRA.load_from_db("sutra.db", api_key="your-key")

# Connect to MySQL
sutra = SUTRA.connect_mysql("localhost", "root", "password", "database")

# Connect to PostgreSQL
sutra = SUTRA.connect_postgres("localhost", "postgres", "password", "database")

4. Import SQLite to MySQL

Step 1: In Colab or Python - Export Database

# After uploading your data
sutra.upload("data.pdf")
sutra.tables()

# Export to SQLite
sutra.export_db("my_data.db", format="sqlite")

# In Colab, download the file
from google.colab import files
files.download("my_data.db")

Step 2: On Windows - Import to MySQL

Method A: Using QuerySUTRA

from sutra import SUTRA

# Load the SQLite database
sutra = SUTRA(api_key="your-key", db="my_data.db")

# Verify tables
sutra.tables()

# Export to MySQL
sutra.save_to_mysql("localhost", "root", "password", "my_database")

Method B: Using simple_import.py Script

Download the conversion script from the repository or create simple_import.py:

import sqlite3
import mysql.connector
import pandas as pd

# Configuration
SQLITE_DB = "my_data.db"
MYSQL_HOST = "localhost"
MYSQL_USER = "root"
MYSQL_PASSWORD = "your_password"
MYSQL_DATABASE = "my_database"

# Connect to SQLite
sqlite_conn = sqlite3.connect(SQLITE_DB)
cursor = sqlite_conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [row[0] for row in cursor.fetchall()]

# Connect to MySQL and create database
mysql_conn_temp = mysql.connector.connect(
    host=MYSQL_HOST, user=MYSQL_USER, password=MYSQL_PASSWORD
)
temp_cursor = mysql_conn_temp.cursor()
temp_cursor.execute(f"CREATE DATABASE IF NOT EXISTS {MYSQL_DATABASE}")
temp_cursor.close()
mysql_conn_temp.close()

# Connect to database and import
mysql_conn = mysql.connector.connect(
    host=MYSQL_HOST, user=MYSQL_USER, 
    password=MYSQL_PASSWORD, database=MYSQL_DATABASE
)
mysql_cursor = mysql_conn.cursor()

for table in tables:
    df = pd.read_sql_query(f"SELECT * FROM {table}", sqlite_conn)
    
    # Create table
    mysql_cursor.execute(f"DROP TABLE IF EXISTS {table}")
    cols = []
    for col in df.columns:
        dtype = 'INT' if df[col].dtype == 'int64' else 'FLOAT' if df[col].dtype == 'float64' else 'TEXT'
        cols.append(f"`{col}` {dtype}")
    mysql_cursor.execute(f"CREATE TABLE {table} ({', '.join(cols)})")
    
    # Insert data
    if len(df) > 0:
        placeholders = ', '.join(['%s'] * len(df.columns))
        for _, row in df.iterrows():
            vals = [None if pd.isna(v) else v for v in row.values]
            mysql_cursor.execute(f"INSERT INTO {table} VALUES ({placeholders})", vals)
    
    mysql_conn.commit()
    print(f"Imported {table}: {len(df)} rows")

sqlite_conn.close()
mysql_cursor.close()
mysql_conn.close()
print("Complete!")

Run: python simple_import.py

Step 3: Verify in MySQL

USE my_database;
SHOW TABLES;
SELECT * FROM employee_data_people;

5. Custom Visualizations

result = sutra.ask("Sales by region", viz="pie")       # Pie chart
result = sutra.ask("Trends", viz="line")               # Line chart
result = sutra.ask("Compare", viz="bar")               # Bar chart
result = sutra.ask("Correlation", viz="scatter")       # Scatter plot
result = sutra.ask("Data", viz="table")                # Table view
result = sutra.ask("Analysis", viz="heatmap")          # Heatmap
result = sutra.ask("Auto", viz=True)                   # Auto-detect

6. Smart Fuzzy Matching

sutra = SUTRA(api_key="your-key", fuzzy_match=True)

# "New York City" automatically matches "New York"
result = sutra.ask("Who are from New York City?")

7. Intelligent Caching with Embeddings

sutra = SUTRA(api_key="your-key", use_embeddings=True)

result = sutra.ask("Show sales")           # Calls API
result = sutra.ask("Display sales data")   # Uses cache (no API call)

8. Irrelevant Query Detection

sutra = SUTRA(api_key="your-key", check_relevance=True)

result = sutra.ask("What is the weather?")
# Warns: "Query may be irrelevant to your database"

9. Direct SQL Access

result = sutra.sql("SELECT * FROM people WHERE city='New York'")
print(result.data)

Complete Configuration

sutra = SUTRA(
    api_key="your-openai-key",
    db="database.db",              # SQLite path
    use_embeddings=True,           # Smart caching (saves API calls)
    check_relevance=True,          # Detect irrelevant queries
    fuzzy_match=True,              # Better NLP
    cache_queries=True             # Simple caching
)

Supported Formats

CSV, Excel, JSON, SQL, PDF, Word, Text, Pandas DataFrame

How It Works

Multi-Table Entity Extraction

From a single PDF, QuerySUTRA automatically creates multiple related tables:

Input: Employee PDF with 40 employees

Output Tables:

  • employee_data_people (40 rows)
  • employee_data_skills (50 rows)
  • employee_data_technologies (30 rows)
  • employee_data_projects (25 rows)
  • employee_data_certifications (20 rows)
  • employee_data_education (40 rows)
  • employee_data_work_experience (35 rows)

Proper Relational Structure

Tables have unique primary keys and proper foreign key relationships:

people table:
  id=1, name="John Doe", city="Dallas"
  id=2, name="Jane Smith", city="New York"

skills table:
  id=1, person_id=1, skill_name="Python"
  id=2, person_id=1, skill_name="SQL"
  id=3, person_id=2, skill_name="Java"

Chunk Processing for Large Documents

v0.4.0 processes documents in 10,000 character chunks, ensuring ALL data is extracted:

  • PDF with 50 employees: Extracts all 50 (not just first 10)
  • Large documents: Processes entire content
  • Merges results with unique IDs across chunks

API Reference

Class Methods

SUTRA.load_from_db(db_path, api_key, **kwargs) - Load existing SQLite

SUTRA.connect_mysql(host, user, password, database, ...) - Connect to MySQL

SUTRA.connect_postgres(host, user, password, database, ...) - Connect to PostgreSQL

Instance Methods

upload(data, name, extract_entities) - Upload data

ask(question, viz, table) - Natural language query

sql(query, viz) - Raw SQL query

tables() - List all tables

schema(table) - Show schema

peek(table, n) - Preview data

export_db(path, format) - Export database (sqlite/sql/json/excel)

save_to_mysql(...) - Export to MySQL

save_to_postgres(...) - Export to PostgreSQL

backup(path) - Create backup

close() - Close connection

Common Workflows

Workflow 1: Analyze PDF in Colab, Export to Local MySQL

# In Colab
from sutra import SUTRA

sutra = SUTRA(api_key="your-key")
sutra.upload("document.pdf")
sutra.tables()

# Export and download
sutra.export_db("data.db", format="sqlite")
from google.colab import files
files.download("data.db")

# On Windows
sutra = SUTRA(api_key="your-key", db="data.db")
sutra.save_to_mysql("localhost", "root", "password", "my_database")

Workflow 2: Load Existing Database and Query

# No need to re-upload data
sutra = SUTRA.load_from_db("data.db", api_key="your-key")
result = sutra.ask("Your question", viz="pie")

Workflow 3: Query MySQL Directly

# Connect and query MySQL database
sutra = SUTRA.connect_mysql("localhost", "root", "password", "production_db")
result = sutra.ask("Show me latest transactions")

Performance Tips

  1. Use load_from_db() to avoid re-uploading
  2. Use sql() for complex queries (no API cost)
  3. Enable use_embeddings=True for caching similar queries
  4. Enable cache_queries=True for exact query matches
  5. For large PDFs (50+ pages), allow extra processing time

Troubleshooting

Only extracting 10 records instead of 50:

  • Fixed in v0.4.0 with chunk processing
  • Upgrade: pip install --upgrade QuerySUTRA

MySQL import fails:

  • Ensure MySQL database exists: CREATE DATABASE my_database;
  • Install dependencies: pip install QuerySUTRA[mysql]
  • Check MySQL is running: mysql -u root -p

Colab disk I/O error:

  • Fixed in v0.4.0 with better connection handling
  • Restart Colab runtime and try again

connect_mysql() not found:

  • Update QuerySUTRA: pip install --upgrade QuerySUTRA
  • Requires v0.3.0 or higher

System Requirements

  • Python 3.8+
  • OpenAI API key
  • 100MB disk space (for embeddings)
  • MySQL/PostgreSQL (optional, for database export)

License

MIT License

Changelog

v0.4.0 (Latest)

  • FIXED: Complete data extraction from large documents
  • Chunk processing for PDFs with 50+ pages
  • All employees extracted (not just first 10)
  • Improved MySQL/PostgreSQL import
  • Better error handling for Colab

v0.3.x

  • Added MySQL/PostgreSQL connectivity
  • Smart caching with embeddings
  • Fuzzy matching for better NLP
  • Custom visualizations
  • Irrelevant query detection

v0.2.x

  • Multi-table entity extraction
  • Proper primary and foreign keys

v0.1.x

  • Initial release

Made by Aditya Batta

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

querysutra-0.4.0.tar.gz (45.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

querysutra-0.4.0-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file querysutra-0.4.0.tar.gz.

File metadata

  • Download URL: querysutra-0.4.0.tar.gz
  • Upload date:
  • Size: 45.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for querysutra-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6ebd10d3fe92ea5010a57c82db7efe0202c040c652d90c792150809180417e0b
MD5 8915dd8682686db0a4e449e332a7105d
BLAKE2b-256 74bbe823bec45f421b83b2e878d67e378d3028491b38367e462eb4a4003b3a8e

See more details on using hashes here.

File details

Details for the file querysutra-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: querysutra-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for querysutra-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7db1c4fac6d969533ed8062f3e7a8d8fbcb072181686e9a880e392516766870a
MD5 0a1c9f8478be6119940746761fa4076a
BLAKE2b-256 ec455143ca5d10ab54add59f1af17d975836bf238a5a101dadc83e4a220f334a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page