Skip to main content

Generative AutoML for Tabular Data

Project description

SapientML

Generative AutoML for Tabular Data

SapientML is an AutoML technology that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.

PyPI version PyPI - Python Version Release Conventional Commits OpenSSF Best Practices

Getting Started

Installation

From PyPI repository

pip install sapientml

From source code:

git clone https://github.com/sapientml/sapientml.git
cd sapientml
pip install poetry
poetry install

Run AutoML

Open In Colab
import pandas as pd
from sapientml import SapientML
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data["survived"].reset_index(drop=True)
test_data.drop(["survived"], axis=1, inplace=True)

sml = SapientML(["survived"])

sml.fit(train_data)
y_pred = sml.predict(test_data)

print(f"F1 score: {f1_score(y_true, y_pred)}")

Running Generated Code Manually

You can get generated code in the output folder after executing fit method.

Hold-out Validation

Run outputs/final_script.py, then you will see a result of the hold-out validation using the train data

cd outputs/
python final_script.py

Train a Model by Generated Code

Run outputs/final_train.py, then you will get several .pkl files containing a trained model and some components for preprocessing.

cd outputs/
python final_train.py

Prediction by using Trained Model

Run outputs/final_predict.py with outputs/test.pkl exist already or prepared manually if not exist. test.pkl must contain a pandas.DataFrame object created from a CSV file fto be predited.

cd outputs/
python final_predict.py

Publications

The technologies of the software originates from the following research paper published at the International Conference on Software Engineering (ICSE), which is one of the premier conferences on Software Engineering.

Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad (2022, May). SapientML: synthesizing machine learning pipelines by learning from human-writen solutions. In Proceedings of the 44th International Conference on Software Engineering (pp. 1932-1944).

@inproceedings{10.1145/3510003.3510226,
author = {Saha, Ripon K. and Ura, Akira and Mahajan, Sonal and Zhu, Chenguang and Li, Linyi and Hu, Yang and Yoshida, Hiroaki and Khurshid, Sarfraz and Prasad, Mukul R.},
title = {SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Writen Solutions},
year = {2022},
isbn = {9781450392211},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3510003.3510226},
doi = {10.1145/3510003.3510226},
abstract = {Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses meta-learning to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using a pipeline dataflow model derived from the corpus. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1,094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 4 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances. This difference is amplified on the 10 most challenging benchmarks, where SapientML wins on 9 instances with the other tools failing to produce pipelines on 4 or more benchmarks.},
booktitle = {Proceedings of the 44th International Conference on Software Engineering},
pages = {1932–1944},
numpages = {13},
keywords = {AutoML, program synthesis, program analysis, machine learning},
location = {Pittsburgh, Pennsylvania},
series = {ICSE '22}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sapientml-0.4.2rc0.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

sapientml-0.4.2rc0-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file sapientml-0.4.2rc0.tar.gz.

File metadata

  • Download URL: sapientml-0.4.2rc0.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Linux/6.5.0-1021-azure

File hashes

Hashes for sapientml-0.4.2rc0.tar.gz
Algorithm Hash digest
SHA256 e1bca8b3cc5a9992246c4fbc77027e3041715f4db284d7da6cc4f0d2ece582e7
MD5 da8a287dc46d5e87e2c81eee89caf079
BLAKE2b-256 499a30ae27187e7040c58221b68181b8b8eba6c8f510c4526bbe1c9d7a8e3a4b

See more details on using hashes here.

File details

Details for the file sapientml-0.4.2rc0-py3-none-any.whl.

File metadata

  • Download URL: sapientml-0.4.2rc0-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Linux/6.5.0-1021-azure

File hashes

Hashes for sapientml-0.4.2rc0-py3-none-any.whl
Algorithm Hash digest
SHA256 46e188c04bac75c795f22c338f8746f1f71bd5ce5f2e8e23bcd8cd71f0fe7b3d
MD5 fa6df20149b7e33025b655ae42183f71
BLAKE2b-256 00d1a14ce049f27e339db09abd7a4d5dcfb85ac83038bed4d392aa8569b50bd7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page