SUTRA: Structured-Unstructured-Text-Retrieval-Architecture - Complete data extraction
Project description
QuerySUTRA
SUTRA: Structured-Unstructured-Text-Retrieval-Architecture
Professional Python library for AI-powered data analysis with automatic entity extraction, natural language querying, and intelligent caching.
Installation
pip install QuerySUTRA
# With optional features
pip install QuerySUTRA[embeddings] # Smart caching
pip install QuerySUTRA[mysql] # MySQL support
pip install QuerySUTRA[postgres] # PostgreSQL support
pip install QuerySUTRA[all] # All features
Key Features
1. Complete Data Extraction from Large Documents
Processes entire documents in chunks - no data loss on large PDFs.
from sutra import SUTRA
sutra = SUTRA(api_key="your-openai-key")
sutra.upload("large_document.pdf") # Extracts ALL data, not just first page
# Automatically creates multiple tables:
# - document_people (40 rows, 8 columns)
# - document_skills (50 rows, 5 columns)
# - document_technologies (30 rows, 4 columns)
# - document_projects (25 rows, 6 columns)
# etc.
2. Natural Language Querying
result = sutra.ask("Show me all people from New York")
print(result.data)
# With visualization
result = sutra.ask("Show sales by region", viz="pie")
3. Load Existing Databases
# Load SQLite database
sutra = SUTRA.load_from_db("sutra.db", api_key="your-key")
# Connect to MySQL
sutra = SUTRA.connect_mysql("localhost", "root", "password", "database")
# Connect to PostgreSQL
sutra = SUTRA.connect_postgres("localhost", "postgres", "password", "database")
4. Import SQLite to MySQL
Step 1: In Colab or Python - Export Database
# After uploading your data
sutra.upload("data.pdf")
sutra.tables()
# Export to SQLite
sutra.export_db("my_data.db", format="sqlite")
# In Colab, download the file
from google.colab import files
files.download("my_data.db")
Step 2: On Windows - Import to MySQL
Method A: Using QuerySUTRA
from sutra import SUTRA
# Load the SQLite database
sutra = SUTRA(api_key="your-key", db="my_data.db")
# Verify tables
sutra.tables()
# Export to MySQL
sutra.save_to_mysql("localhost", "root", "password", "my_database")
Method B: Using simple_import.py Script
Download the conversion script from the repository or create simple_import.py:
import sqlite3
import mysql.connector
import pandas as pd
# Configuration
SQLITE_DB = "my_data.db"
MYSQL_HOST = "localhost"
MYSQL_USER = "root"
MYSQL_PASSWORD = "your_password"
MYSQL_DATABASE = "my_database"
# Connect to SQLite
sqlite_conn = sqlite3.connect(SQLITE_DB)
cursor = sqlite_conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [row[0] for row in cursor.fetchall()]
# Connect to MySQL and create database
mysql_conn_temp = mysql.connector.connect(
host=MYSQL_HOST, user=MYSQL_USER, password=MYSQL_PASSWORD
)
temp_cursor = mysql_conn_temp.cursor()
temp_cursor.execute(f"CREATE DATABASE IF NOT EXISTS {MYSQL_DATABASE}")
temp_cursor.close()
mysql_conn_temp.close()
# Connect to database and import
mysql_conn = mysql.connector.connect(
host=MYSQL_HOST, user=MYSQL_USER,
password=MYSQL_PASSWORD, database=MYSQL_DATABASE
)
mysql_cursor = mysql_conn.cursor()
for table in tables:
df = pd.read_sql_query(f"SELECT * FROM {table}", sqlite_conn)
# Create table
mysql_cursor.execute(f"DROP TABLE IF EXISTS {table}")
cols = []
for col in df.columns:
dtype = 'INT' if df[col].dtype == 'int64' else 'FLOAT' if df[col].dtype == 'float64' else 'TEXT'
cols.append(f"`{col}` {dtype}")
mysql_cursor.execute(f"CREATE TABLE {table} ({', '.join(cols)})")
# Insert data
if len(df) > 0:
placeholders = ', '.join(['%s'] * len(df.columns))
for _, row in df.iterrows():
vals = [None if pd.isna(v) else v for v in row.values]
mysql_cursor.execute(f"INSERT INTO {table} VALUES ({placeholders})", vals)
mysql_conn.commit()
print(f"Imported {table}: {len(df)} rows")
sqlite_conn.close()
mysql_cursor.close()
mysql_conn.close()
print("Complete!")
Run: python simple_import.py
Step 3: Verify in MySQL
USE my_database;
SHOW TABLES;
SELECT * FROM employee_data_people;
5. Custom Visualizations
result = sutra.ask("Sales by region", viz="pie") # Pie chart
result = sutra.ask("Trends", viz="line") # Line chart
result = sutra.ask("Compare", viz="bar") # Bar chart
result = sutra.ask("Correlation", viz="scatter") # Scatter plot
result = sutra.ask("Data", viz="table") # Table view
result = sutra.ask("Analysis", viz="heatmap") # Heatmap
result = sutra.ask("Auto", viz=True) # Auto-detect
6. Smart Fuzzy Matching
sutra = SUTRA(api_key="your-key", fuzzy_match=True)
# "New York City" automatically matches "New York"
result = sutra.ask("Who are from New York City?")
7. Intelligent Caching with Embeddings
sutra = SUTRA(api_key="your-key", use_embeddings=True)
result = sutra.ask("Show sales") # Calls API
result = sutra.ask("Display sales data") # Uses cache (no API call)
8. Irrelevant Query Detection
sutra = SUTRA(api_key="your-key", check_relevance=True)
result = sutra.ask("What is the weather?")
# Warns: "Query may be irrelevant to your database"
9. Direct SQL Access
result = sutra.sql("SELECT * FROM people WHERE city='New York'")
print(result.data)
Complete Configuration
sutra = SUTRA(
api_key="your-openai-key",
db="database.db", # SQLite path
use_embeddings=True, # Smart caching (saves API calls)
check_relevance=True, # Detect irrelevant queries
fuzzy_match=True, # Better NLP
cache_queries=True # Simple caching
)
Supported Formats
CSV, Excel, JSON, SQL, PDF, Word, Text, Pandas DataFrame
How It Works
Multi-Table Entity Extraction
From a single PDF, QuerySUTRA automatically creates multiple related tables:
Input: Employee PDF with 40 employees
Output Tables:
- employee_data_people (40 rows)
- employee_data_skills (50 rows)
- employee_data_technologies (30 rows)
- employee_data_projects (25 rows)
- employee_data_certifications (20 rows)
- employee_data_education (40 rows)
- employee_data_work_experience (35 rows)
Proper Relational Structure
Tables have unique primary keys and proper foreign key relationships:
people table:
id=1, name="John Doe", city="Dallas"
id=2, name="Jane Smith", city="New York"
skills table:
id=1, person_id=1, skill_name="Python"
id=2, person_id=1, skill_name="SQL"
id=3, person_id=2, skill_name="Java"
Chunk Processing for Large Documents
v0.4.0 processes documents in 10,000 character chunks, ensuring ALL data is extracted:
- PDF with 50 employees: Extracts all 50 (not just first 10)
- Large documents: Processes entire content
- Merges results with unique IDs across chunks
API Reference
Class Methods
SUTRA.load_from_db(db_path, api_key, **kwargs) - Load existing SQLite
SUTRA.connect_mysql(host, user, password, database, ...) - Connect to MySQL
SUTRA.connect_postgres(host, user, password, database, ...) - Connect to PostgreSQL
Instance Methods
upload(data, name, extract_entities) - Upload data
ask(question, viz, table) - Natural language query
sql(query, viz) - Raw SQL query
tables() - List all tables
schema(table) - Show schema
peek(table, n) - Preview data
export_db(path, format) - Export database (sqlite/sql/json/excel)
save_to_mysql(...) - Export to MySQL
save_to_postgres(...) - Export to PostgreSQL
backup(path) - Create backup
close() - Close connection
Common Workflows
Workflow 1: Analyze PDF in Colab, Export to Local MySQL
# In Colab
from sutra import SUTRA
sutra = SUTRA(api_key="your-key")
sutra.upload("document.pdf")
sutra.tables()
# Export and download
sutra.export_db("data.db", format="sqlite")
from google.colab import files
files.download("data.db")
# On Windows
sutra = SUTRA(api_key="your-key", db="data.db")
sutra.save_to_mysql("localhost", "root", "password", "my_database")
Workflow 2: Load Existing Database and Query
# No need to re-upload data
sutra = SUTRA.load_from_db("data.db", api_key="your-key")
result = sutra.ask("Your question", viz="pie")
Workflow 3: Query MySQL Directly
# Connect and query MySQL database
sutra = SUTRA.connect_mysql("localhost", "root", "password", "production_db")
result = sutra.ask("Show me latest transactions")
Performance Tips
- Use
load_from_db()to avoid re-uploading - Use
sql()for complex queries (no API cost) - Enable
use_embeddings=Truefor caching similar queries - Enable
cache_queries=Truefor exact query matches - For large PDFs (50+ pages), allow extra processing time
Troubleshooting
Only extracting 10 records instead of 50:
- Fixed in v0.4.0 with chunk processing
- Upgrade:
pip install --upgrade QuerySUTRA
MySQL import fails:
- Ensure MySQL database exists:
CREATE DATABASE my_database; - Install dependencies:
pip install QuerySUTRA[mysql] - Check MySQL is running:
mysql -u root -p
Colab disk I/O error:
- Fixed in v0.4.0 with better connection handling
- Restart Colab runtime and try again
connect_mysql() not found:
- Update QuerySUTRA:
pip install --upgrade QuerySUTRA - Requires v0.3.0 or higher
System Requirements
- Python 3.8+
- OpenAI API key
- 100MB disk space (for embeddings)
- MySQL/PostgreSQL (optional, for database export)
License
MIT License
Changelog
v0.4.0 (Latest)
- FIXED: Complete data extraction from large documents
- Chunk processing for PDFs with 50+ pages
- All employees extracted (not just first 10)
- Improved MySQL/PostgreSQL import
- Better error handling for Colab
v0.3.x
- Added MySQL/PostgreSQL connectivity
- Smart caching with embeddings
- Fuzzy matching for better NLP
- Custom visualizations
- Irrelevant query detection
v0.2.x
- Multi-table entity extraction
- Proper primary and foreign keys
v0.1.x
- Initial release
Made by Aditya Batta
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file querysutra-0.4.0.tar.gz.
File metadata
- Download URL: querysutra-0.4.0.tar.gz
- Upload date:
- Size: 45.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ebd10d3fe92ea5010a57c82db7efe0202c040c652d90c792150809180417e0b
|
|
| MD5 |
8915dd8682686db0a4e449e332a7105d
|
|
| BLAKE2b-256 |
74bbe823bec45f421b83b2e878d67e378d3028491b38367e462eb4a4003b3a8e
|
File details
Details for the file querysutra-0.4.0-py3-none-any.whl.
File metadata
- Download URL: querysutra-0.4.0-py3-none-any.whl
- Upload date:
- Size: 48.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7db1c4fac6d969533ed8062f3e7a8d8fbcb072181686e9a880e392516766870a
|
|
| MD5 |
0a1c9f8478be6119940746761fa4076a
|
|
| BLAKE2b-256 |
ec455143ca5d10ab54add59f1af17d975836bf238a5a101dadc83e4a220f334a
|