SUTRA: Structured-Unstructured-Text-Retrieval-Architecture - Complete data extraction

These details have not been verified by PyPI

Project description

QuerySUTRA

SUTRA: Structured-Unstructured-Text-Retrieval-Architecture

Professional Python library for AI-powered data analysis with automatic entity extraction, natural language querying, and intelligent caching.

Installation

pip install QuerySUTRA

# With optional features
pip install QuerySUTRA[embeddings]  # Smart caching
pip install QuerySUTRA[mysql]       # MySQL support
pip install QuerySUTRA[postgres]    # PostgreSQL support
pip install QuerySUTRA[all]         # All features

Key Features

1. Complete Data Extraction from Large Documents

Processes entire documents in chunks - no data loss on large PDFs.

from sutra import SUTRA

sutra = SUTRA(api_key="your-openai-key")
sutra.upload("large_document.pdf")  # Extracts ALL data, not just first page

# Automatically creates multiple tables:
# - document_people (40 rows, 8 columns)
# - document_skills (50 rows, 5 columns)
# - document_technologies (30 rows, 4 columns)
# - document_projects (25 rows, 6 columns)
# etc.

2. Natural Language Querying

result = sutra.ask("Show me all people from New York")
print(result.data)

# With visualization
result = sutra.ask("Show sales by region", viz="pie")

3. Load Existing Databases

# Load SQLite database
sutra = SUTRA.load_from_db("sutra.db", api_key="your-key")

# Connect to MySQL
sutra = SUTRA.connect_mysql("localhost", "root", "password", "database")

# Connect to PostgreSQL
sutra = SUTRA.connect_postgres("localhost", "postgres", "password", "database")

4. Import SQLite to MySQL

Step 1: In Colab or Python - Export Database

# After uploading your data
sutra.upload("data.pdf")
sutra.tables()

# Export to SQLite
sutra.export_db("my_data.db", format="sqlite")

# In Colab, download the file
from google.colab import files
files.download("my_data.db")

Step 2: On Windows - Import to MySQL

Method A: Using QuerySUTRA

from sutra import SUTRA

# Load the SQLite database
sutra = SUTRA(api_key="your-key", db="my_data.db")

# Verify tables
sutra.tables()

# Export to MySQL
sutra.save_to_mysql("localhost", "root", "password", "my_database")

Method B: Using simple_import.py Script

Download the conversion script from the repository or create simple_import.py:

import sqlite3
import mysql.connector
import pandas as pd

# Configuration
SQLITE_DB = "my_data.db"
MYSQL_HOST = "localhost"
MYSQL_USER = "root"
MYSQL_PASSWORD = "your_password"
MYSQL_DATABASE = "my_database"

# Connect to SQLite
sqlite_conn = sqlite3.connect(SQLITE_DB)
cursor = sqlite_conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [row[0] for row in cursor.fetchall()]

# Connect to MySQL and create database
mysql_conn_temp = mysql.connector.connect(
    host=MYSQL_HOST, user=MYSQL_USER, password=MYSQL_PASSWORD
)
temp_cursor = mysql_conn_temp.cursor()
temp_cursor.execute(f"CREATE DATABASE IF NOT EXISTS {MYSQL_DATABASE}")
temp_cursor.close()
mysql_conn_temp.close()

# Connect to database and import
mysql_conn = mysql.connector.connect(
    host=MYSQL_HOST, user=MYSQL_USER, 
    password=MYSQL_PASSWORD, database=MYSQL_DATABASE
)
mysql_cursor = mysql_conn.cursor()

for table in tables:
    df = pd.read_sql_query(f"SELECT * FROM {table}", sqlite_conn)
    
    # Create table
    mysql_cursor.execute(f"DROP TABLE IF EXISTS {table}")
    cols = []
    for col in df.columns:
        dtype = 'INT' if df[col].dtype == 'int64' else 'FLOAT' if df[col].dtype == 'float64' else 'TEXT'
        cols.append(f"`{col}` {dtype}")
    mysql_cursor.execute(f"CREATE TABLE {table} ({', '.join(cols)})")
    
    # Insert data
    if len(df) > 0:
        placeholders = ', '.join(['%s'] * len(df.columns))
        for _, row in df.iterrows():
            vals = [None if pd.isna(v) else v for v in row.values]
            mysql_cursor.execute(f"INSERT INTO {table} VALUES ({placeholders})", vals)
    
    mysql_conn.commit()
    print(f"Imported {table}: {len(df)} rows")

sqlite_conn.close()
mysql_cursor.close()
mysql_conn.close()
print("Complete!")

Run: python simple_import.py

Step 3: Verify in MySQL

USE my_database;
SHOW TABLES;
SELECT * FROM employee_data_people;

5. Custom Visualizations

result = sutra.ask("Sales by region", viz="pie")       # Pie chart
result = sutra.ask("Trends", viz="line")               # Line chart
result = sutra.ask("Compare", viz="bar")               # Bar chart
result = sutra.ask("Correlation", viz="scatter")       # Scatter plot
result = sutra.ask("Data", viz="table")                # Table view
result = sutra.ask("Analysis", viz="heatmap")          # Heatmap
result = sutra.ask("Auto", viz=True)                   # Auto-detect

6. Smart Fuzzy Matching

sutra = SUTRA(api_key="your-key", fuzzy_match=True)

# "New York City" automatically matches "New York"
result = sutra.ask("Who are from New York City?")

7. Intelligent Caching with Embeddings

sutra = SUTRA(api_key="your-key", use_embeddings=True)

result = sutra.ask("Show sales")           # Calls API
result = sutra.ask("Display sales data")   # Uses cache (no API call)

8. Irrelevant Query Detection

sutra = SUTRA(api_key="your-key", check_relevance=True)

result = sutra.ask("What is the weather?")
# Warns: "Query may be irrelevant to your database"

9. Direct SQL Access

result = sutra.sql("SELECT * FROM people WHERE city='New York'")
print(result.data)

Complete Configuration

sutra = SUTRA(
    api_key="your-openai-key",
    db="database.db",              # SQLite path
    use_embeddings=True,           # Smart caching (saves API calls)
    check_relevance=True,          # Detect irrelevant queries
    fuzzy_match=True,              # Better NLP
    cache_queries=True             # Simple caching
)

Supported Formats

CSV, Excel, JSON, SQL, PDF, Word, Text, Pandas DataFrame

How It Works

Multi-Table Entity Extraction

From a single PDF, QuerySUTRA automatically creates multiple related tables:

Input: Employee PDF with 40 employees

Output Tables:

employee_data_people (40 rows)
employee_data_skills (50 rows)
employee_data_technologies (30 rows)
employee_data_projects (25 rows)
employee_data_certifications (20 rows)
employee_data_education (40 rows)
employee_data_work_experience (35 rows)

Proper Relational Structure

Tables have unique primary keys and proper foreign key relationships:

people table:
  id=1, name="John Doe", city="Dallas"
  id=2, name="Jane Smith", city="New York"

skills table:
  id=1, person_id=1, skill_name="Python"
  id=2, person_id=1, skill_name="SQL"
  id=3, person_id=2, skill_name="Java"

Chunk Processing for Large Documents

v0.4.0 processes documents in 10,000 character chunks, ensuring ALL data is extracted:

PDF with 50 employees: Extracts all 50 (not just first 10)
Large documents: Processes entire content
Merges results with unique IDs across chunks

API Reference

Class Methods

SUTRA.load_from_db(db_path, api_key, **kwargs) - Load existing SQLite

SUTRA.connect_mysql(host, user, password, database, ...) - Connect to MySQL

SUTRA.connect_postgres(host, user, password, database, ...) - Connect to PostgreSQL

Instance Methods

upload(data, name, extract_entities) - Upload data

ask(question, viz, table) - Natural language query

sql(query, viz) - Raw SQL query

tables() - List all tables

schema(table) - Show schema

peek(table, n) - Preview data

export_db(path, format) - Export database (sqlite/sql/json/excel)

save_to_mysql(...) - Export to MySQL

save_to_postgres(...) - Export to PostgreSQL

backup(path) - Create backup

close() - Close connection

Common Workflows

Workflow 1: Analyze PDF in Colab, Export to Local MySQL

# In Colab
from sutra import SUTRA

sutra = SUTRA(api_key="your-key")
sutra.upload("document.pdf")
sutra.tables()

# Export and download
sutra.export_db("data.db", format="sqlite")
from google.colab import files
files.download("data.db")

# On Windows
sutra = SUTRA(api_key="your-key", db="data.db")
sutra.save_to_mysql("localhost", "root", "password", "my_database")

Workflow 2: Load Existing Database and Query

# No need to re-upload data
sutra = SUTRA.load_from_db("data.db", api_key="your-key")
result = sutra.ask("Your question", viz="pie")

Workflow 3: Query MySQL Directly

# Connect and query MySQL database
sutra = SUTRA.connect_mysql("localhost", "root", "password", "production_db")
result = sutra.ask("Show me latest transactions")

Performance Tips

Use load_from_db() to avoid re-uploading
Use sql() for complex queries (no API cost)
Enable use_embeddings=True for caching similar queries
Enable cache_queries=True for exact query matches
For large PDFs (50+ pages), allow extra processing time

Troubleshooting

Only extracting 10 records instead of 50:

Fixed in v0.4.0 with chunk processing
Upgrade: pip install --upgrade QuerySUTRA

MySQL import fails:

Ensure MySQL database exists: CREATE DATABASE my_database;
Install dependencies: pip install QuerySUTRA[mysql]
Check MySQL is running: mysql -u root -p

Colab disk I/O error:

Fixed in v0.4.0 with better connection handling
Restart Colab runtime and try again

connect_mysql() not found:

Update QuerySUTRA: pip install --upgrade QuerySUTRA
Requires v0.3.0 or higher

System Requirements

Python 3.8+
OpenAI API key
100MB disk space (for embeddings)
MySQL/PostgreSQL (optional, for database export)

License

MIT License

Changelog

v0.4.0 (Latest)

FIXED: Complete data extraction from large documents
Chunk processing for PDFs with 50+ pages
All employees extracted (not just first 10)
Improved MySQL/PostgreSQL import
Better error handling for Colab

v0.3.x

Added MySQL/PostgreSQL connectivity
Smart caching with embeddings
Fuzzy matching for better NLP
Custom visualizations
Irrelevant query detection

v0.2.x

Multi-table entity extraction
Proper primary and foreign keys

v0.1.x

Initial release

Made by Aditya Batta

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.2

Feb 6, 2026

0.6.1

Feb 5, 2026

0.6.0

Feb 5, 2026

0.5.3

Nov 18, 2025

0.5.2

Nov 17, 2025

0.5.1

Nov 17, 2025

0.5.0

Nov 17, 2025

0.4.6

Nov 17, 2025

0.4.5

Nov 17, 2025

0.4.4

Nov 17, 2025

0.4.3

Nov 17, 2025

0.4.2

Nov 17, 2025

0.4.1

Nov 17, 2025

This version

0.4.0

Nov 17, 2025

0.3.3

Nov 16, 2025

0.3.2

Nov 14, 2025

0.3.1

Nov 14, 2025

0.3.0

Nov 14, 2025

0.2.3

Nov 14, 2025

0.2.1

Nov 14, 2025

0.2.0

Nov 14, 2025

0.1.4

Nov 13, 2025

0.1.3

Nov 13, 2025

0.1.2

Nov 13, 2025

0.1.0

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

querysutra-0.4.0.tar.gz (45.1 kB view details)

Uploaded Nov 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

querysutra-0.4.0-py3-none-any.whl (48.4 kB view details)

Uploaded Nov 17, 2025 Python 3

File details

Details for the file querysutra-0.4.0.tar.gz.

File metadata

Download URL: querysutra-0.4.0.tar.gz
Upload date: Nov 17, 2025
Size: 45.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for querysutra-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`6ebd10d3fe92ea5010a57c82db7efe0202c040c652d90c792150809180417e0b`
MD5	`8915dd8682686db0a4e449e332a7105d`
BLAKE2b-256	`74bbe823bec45f421b83b2e878d67e378d3028491b38367e462eb4a4003b3a8e`

See more details on using hashes here.

File details

Details for the file querysutra-0.4.0-py3-none-any.whl.

File metadata

Download URL: querysutra-0.4.0-py3-none-any.whl
Upload date: Nov 17, 2025
Size: 48.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for querysutra-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7db1c4fac6d969533ed8062f3e7a8d8fbcb072181686e9a880e392516766870a`
MD5	`0a1c9f8478be6119940746761fa4076a`
BLAKE2b-256	`ec455143ca5d10ab54add59f1af17d975836bf238a5a101dadc83e4a220f334a`

See more details on using hashes here.

QuerySUTRA 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

QuerySUTRA

Installation

Key Features

1. Complete Data Extraction from Large Documents

2. Natural Language Querying

3. Load Existing Databases

4. Import SQLite to MySQL

5. Custom Visualizations

6. Smart Fuzzy Matching

7. Intelligent Caching with Embeddings

8. Irrelevant Query Detection

9. Direct SQL Access

Complete Configuration

Supported Formats

How It Works

Multi-Table Entity Extraction

Proper Relational Structure

Chunk Processing for Large Documents

API Reference

Class Methods

Instance Methods

Common Workflows

Workflow 1: Analyze PDF in Colab, Export to Local MySQL

Workflow 2: Load Existing Database and Query

Workflow 3: Query MySQL Directly

Performance Tips

Troubleshooting

System Requirements

License

Changelog

v0.4.0 (Latest)

v0.3.x

v0.2.x

v0.1.x

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes