Use LLMs to automate data science tasks
Project description
Data Pilot
DISCLAIMER: This is a work in progress and currently under development.
DataPilot is a python package that automates certain Data Science and AI engineering tasks such as data-preprocessing and data-analysis by using language models. It has been developed to primarily aid Data scientists, Data analysts, and ML/AI engineers in routine tasks pertaining to textual and numeric datasets. But it can be used by anyone who has to work with data, for example a financial analyst or a market researcher (provided they know a bit of python).
In simple terms, DataPilot utilizes LLMs to perform operations on datasets (primarily textual datasets of tabular and KV formats. Not meant for audio and image datasets). You can use locally stored CSV or JSON files or APIs to databases / warehouses / lakes. In case of APIs, it's prefered that the data recieved is in Tabular or KV (key-value) format.
Similary, you can use AI models via APIs or by providing the path to a model stored locally.
Type of tasks that you can perform using Data-pilot:
- Data-preprocessing
- Data-analysis
- NLP tasks
Limitations
-
Capped context length: Currently context length can be max 32k tokens (for GPT-4 and 3.5, which is still quite big), and it's lesser for other models. However, data-pilot provides a work-around solution to this limitation by sending data in batches.
-
LLMs can make mistakes: Responses might sometimes have inconsistent data format, or they might be incomplete. Please crosscheck the outputs that your recieve to ensure accuracy.
Data pre-processing
-
Data Exploration: Analyze the dataset to identify missing values, outliers, inconsistent formats, and other types of errors. Creating bar charts or histograms to analyze missing data in different features.
-
Data Cleaning: Depending on the dataset, you may need to perform data cleaning operations, such as removing duplicates, handling missing values, removing special characters, standardizing text, correcting spelling mistakes, performing text normalization, or filtering out irrelevant data. Different approaches for handling missing data, such as imputation techniques (mean, median, mode), deletion of rows or columns, or using advanced imputation methods.
-
Outlier Detection: Ientifying and handling outliers in a dataset using statistical techniques.
-
Data Transformation: Transforming the data to address issues like inconsistent formatting, converting data types, handling outliers, and normalizing or scaling variables.
-
Data Validation: validate the data against predefined rules or constraints to identify data integrity issues.
-
Feature Extraction: Automatically selects the relevant columns (features) from the dataset that you would ideally want to use for your task. In some cases, you may need to transform or preprocess the raw features to make them suitable for the model.
-
Feature Engineering: Create new features or derive meaningful information from existing features to improve the quality and usefulness of the dataset.
-
Encoding Categorical Variables: If your dataset contains categorical variables (e.g., gender, country), you might need to encode them into numerical representations using techniques like one-hot encoding or label encoding.
-
Splitting the Dataset: Divide the dataset into training, validation, and test sets.
-
Data Loading (for Deep Learning): For deep learning, you'll typically use data loaders to efficiently load and process data in mini-batches during training and evaluation.
-
Tokenization (for NLP tasks): If you are working with natural language processing (NLP) tasks, you'll need to tokenize the text data into numerical representations that the language model can understand. This step is crucial for tasks like text classification, sentiment analysis, and language generation.
Data-analysis
-
EDA: Visualizing distributions of numerical variables with histograms, density plots, and box plots. Plotting bar charts or pie charts to explore categorical variables and their frequencies.
-
Time Series Analysis: Plotting time series data with line charts to observe trends and patterns. Creating seasonal decomposition plots to identify seasonal patterns.
-
Correlation Analysis: Generating scatter plots to visualize the relationship between two numerical variables. Creating heatmap plots to display the correlation matrix of multiple variables.
-
Comparisons and Rankings: Building bar charts or grouped bar charts to compare multiple categories side by side. Creating horizontal bar charts for ranking entities based on specific criteria.
-
Geospatial Data Visualization: Plotting data on geographical maps to observe spatial patterns and trends.
-
Hierarchical Data Visualization: Generating tree maps or sunburst charts to visualize hierarchical data structures.
-
Data Clustering and Dimensionality Reduction: Creating scatter plots colored by clusters or reduced dimensions to analyze data patterns.
-
Anomaly Detection: Generating box plots or scatter plots to identify outliers or anomalies in data.
-
Time Series Forecasting: Plotting original and forecasted time series data for evaluation.
-
Text Data Analysis: Visualizing word frequencies with word clouds or bar charts. e.g. creating sentiment analysis visualizations.
-
Network Analysis: Plotting network graphs to visualize relationships between nodes and edges.
-
Statistical Analysis: Creating violin plots or box plots to visualize statistical distributions.
-
Interactive Data Visualization: Building interactive plots, such as scatter plots with tooltips or linked visualizations.
NLP Tasks:
-
Language Translation: Translate text between different languages. They provide the input text in one language, specify the target language, and receive the translated text as the output.
-
Text Summarization: Summarize lengthy texts. Feed the dataset into Data Pilot and receives a concise summary of the content as the result.
-
Sentiment Analysis: Analyze the sentiment expressed in a dataset. They can determine whether the sentiment is positive, negative, or neutral.
-
Named Entity Recognition (NER): Identify and classify named entities like names of people, organizations, locations, etc. for a given dataset.
-
Text Classification: Categorize text into predefined classes. Provide data input, and the model assigns it to the appropriate category.
-
Question Answering: Ask questions based on a given context. The model processes the context and returns relevant answers to the questions. Advised that you provide a set of Q&A samples for best results.
-
Text Generation: Generate creative and coherent text based on a given dataset. They can use this for various tasks like story generation, poetry, and code generation.
-
Chatbots and Conversational AI: Build chatbots and conversational AI systems that can interact with users by processing their queries and providing relevant responses in a conversational manner. Use a dataset with sample conversations between a user and a chatbot.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data_pilot-0.0.2.tar.gz
.
File metadata
- Download URL: data_pilot-0.0.2.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.19.0-46-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 517374a1a18d99ca12170fefae8edd2c489921555febf4f1f3266958f80f5609 |
|
MD5 | 720c02bb529315e4796f284ec2ec25ad |
|
BLAKE2b-256 | 213a7d912b4fb2a845a1b6324aa8ef8aebd32173c21ce5ecaeb56f365866623e |
File details
Details for the file data_pilot-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: data_pilot-0.0.2-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.19.0-46-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 912cc2b1f316dd3e395c20a1f7bbe885e6ded95384a560749e8919cbc28082c8 |
|
MD5 | 6685e0ab188a9334490c5de51e39f90b |
|
BLAKE2b-256 | 800450bc47f32a73663b1ba252df0662273a5c38462b6e4a8c086370c72a66ba |