Skip to main content

SmartDataAI: Intelligent Data Cleaning, Transformation, Charting, and Analysis with LLM

Project description

SmartDataAI

✨ SmartDataAI: Intelligent Data Cleaning, Transformation, Charting, and Analysis with LLM ✨

SmartDataAI is a powerful Python library designed for developers to create interactive, conversational interfaces for data cleaning, transformation, charting, and analysis. By leveraging large language models (LLMs), Smartdataai enables quick responses to your data queries and effortlessly returns answers.

Need to update or clean your dataset? Simply request changes, and Smartdataai will provide a new, updated dataframe. For charting, it delivers a fully-rendered Matplotlib figure object, ready to be displayed in your interface or saved locally.

Bring your data to life with Smartdataai!

Installation

Use the package manager pip to install SmartDataAI.

pip install smartdataai

Features

Smartdataai acts as an intelligent LLM agent designed specifically to manage dataframe-related requests. You can easily integrate any LLM class from Langchain, with GPT-4-mini as the default. While Langchain agents are simple to use, getting actionable outputs like dataframes and charts can be tricky. That's where Smartdataai shines — delivering answers, dataframes, and visualizations all in one seamless interaction.

Key Features:

  • Supports any LLM class from Langchain (default: GPT-4-mini)
  • Conversational interface with memory
  • Buildin auto data cleaning
  • Effortlessly generate answers, update dataframes, and create charts
  • Easy integration with FastAPI or Streamlit

🔥 If you are using any LLM that required an API Key (like GPT), you should setup as evniroment variable, See example 1.

Auto Data Cleaning Guideline

General:

  • Remove empty rows and remove columns with over 90% missing data.

Numeric Data:

  • Impute missing and N/A values with the mean.
  • Outliers are capped between the 1st and 99th percentiles.
  • Replace unreasonable values (e.g., negative salary or age over 200) with the mean.

Text Data:

  • Mark missing and N/A values as "Not Specified"
  • Merge similar categories by treating lowercase and uppercase values as equivalent, and combining abbreviations (e.g., 'US' and 'USA' merged into 'United States' or 'Women', F' and 'female' into 'Female').

Integration with Streamlit

Intration with streamlit is natual where you may use st.pyplot to show the charts and use st.dataframe to show the dataframe. Here is an application we developed for a human resources dataset.

Watch the demo for data designer

Watch the demo for data analyst

Integration with FastAPI

Once you have the matplotlib figures from the model, you may convert it to

import io
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import matplotlib.pyplot as plt

app = FastAPI()

@app.get("/plot")
async def get_plot(fig):
    
    # 1. Convert the figure to PNG image bytes using BytesIO
    img_io = io.BytesIO()
    fig.savefig(img_io, format='png')
    img_io.seek(0)  # Rewind the buffer to the beginning
    
    # 2. Return the image as a StreamingResponse
    return StreamingResponse(img_io, media_type="image/png")

Example 1 - Getting Start with Auto Clean Data

import os
import pandas as pd
from smartdata import SmartData
from dotenv import load_dotenv

load_dotenv()
os.getenv('OPENAI_API_KEY')

# Or Set OpenAI API key here :)
# os.environ["OPENAI_API_KEY"] = "Your openai key"

# Read sample data
df = pd.read_csv(r"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv", index_col=0)

# Create SmartData Model
sd = SmartData(df, memory_size = 0, show_detail = True)
prompt, sd_model = sd.create_model()

# Clean Data 
# - summary: this is a summary of data cleaning result include action taken, impacted records etc. 
# - has_changes_to_df: this is a boolean to indicate whether any changes to the existing df.
# - df_new: this is the new cleaned dataframe after all the clean process.
summary, has_changes_to_df, df_new = sd.clean_data()
print(summary)
print("has_changes_to_df: "+has_changes_to_df)
print(df_new.head(5))

Example 2 - Q&A with Auto Clean Data

import os
import pandas as pd
from smartdata import SmartData
from dotenv import load_dotenv
from matplotlib import pyplot as plt

load_dotenv()
os.getenv('OPENAI_API_KEY')

# Or Set OpenAI API key here :)
# os.environ["OPENAI_API_KEY"] = "Your openai key"

# Load sample data
df_clean = pd.read_csv(r"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv", index_col=0)

# Initialize SmartData Model to clean up data
sd_clean = SmartData(df_list=df_clean, memory_size=0, show_detail=False)
prompt, sd_model = sd_clean.create_model()
summary, has_changes_to_df, df_new = sd_clean.clean_data()

# Initialize SmartData Model with memory for the last 3 conversations and detailed outputs
# Load in cleaned data
smartdata_qa = SmartData(df_list=df_new, memory_size=3, show_detail=False)
qa_prompt, qa_model = smartdata_qa.create_model()

# Start Q&A session -------------------------------------------------

# Output Explanation:
# answer: The response to your question, formatted in markdown.
# has_plots: Boolean indicating if a chart was generated.
# has_changes_to_df: Boolean indicating if the dataframe was updated.
# image_fig_list: List of matplotlib figures (if has_plots is True).
# df_new: Updated dataframe (if has_changes_to_df is True); otherwise, a copy of the original dataframe.
# response: Detailed output of all intermediate steps generated by the model.
# code_list_plot_with_add_on: Python code to generate the figures in image_fig_list.
# code_list_datachange_with_add_on: Python code to apply the dataframe updates resulting in df_new.

# Q1 - General analytics question - no charting no new dataframe
question_1 = "Please show me the average fare by sex in a table."
answer, has_plots, has_changes_to_df, image_fig_list, df_new, response, code_list, code_list_plot_with_add_on, code_list_datachange_with_add_on = smartdata_qa.run_model(question=question_1)
print("\n------------Q1------------\n")
print(answer)
print("has_plots - " + str(has_plots))
print("has_changes_to_df - " + str(has_changes_to_df))

# Q2 - Ask for making a chart
question_2 = "Please make a bar chart with average Age by Pclass."
answer, has_plots, has_changes_to_df, image_fig_list, df_new, response, code_list, code_list_plot_with_add_on, code_list_datachange_with_add_on = smartdata_qa.run_model(question=question_2)
print("\n------------Q2------------\n")
print(answer)
print("has_plots - " + str(has_plots))
print("has_changes_to_df - " + str(has_changes_to_df))
for fig in image_fig_list:
    plt.show(fig)

# Q3 - Ask for data transformation
question_3 = "Can you create a new column called age over 30, valid entries are yes or no."
answer, has_plots, has_changes_to_df, image_fig_list, df_new, response, code_list, code_list_plot_with_add_on, code_list_datachange_with_add_on = smartdata_qa.run_model(question=question_3)
print("\n------------Q3------------\n")
print(answer)
print("has_plots - " + str(has_plots))
print("has_changes_to_df - " + str(has_changes_to_df))
print(df_new.head(3))

# Q4 - Chat with memory
question_4 = "Can you delete the new column you just created?"
answer, has_plots, has_changes_to_df, image_fig_list, df_new, response, code_list, code_list_plot_with_add_on, code_list_datachange_with_add_on = smartdata_qa.run_model(question=question_4)
print("\n------------Q4------------\n")
print(answer)
print("has_plots - " + str(has_plots))
print("has_changes_to_df - " + str(has_changes_to_df))
print(df_new.head(3))

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Email us at contact@talentainow.com

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdataai-1.2.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

smartdataai-1.2-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file smartdataai-1.2.tar.gz.

File metadata

  • Download URL: smartdataai-1.2.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for smartdataai-1.2.tar.gz
Algorithm Hash digest
SHA256 c49ec72491fbad391b8dbd34a1e899b8aebc2564cd22e01f714f5abad6033f96
MD5 d46f84e93b9c84b010547d311ebfcda1
BLAKE2b-256 4b3052af6f5a794bd0c256c1eda1e276294e36a35055161909c67ef821d07356

See more details on using hashes here.

File details

Details for the file smartdataai-1.2-py3-none-any.whl.

File metadata

  • Download URL: smartdataai-1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for smartdataai-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc323e53bd10d03b574c266096d493f686b1d14ed8d3d6de30e109953dfb2454
MD5 001f908cb4eec2c1745bafaf75879682
BLAKE2b-256 025a7e7e3623826d6c496c4aeb7cb9a0fde1ab9e96ce603b346683394445e99e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page