Features search library for supervised machine learning on tabular data
Project description
Upgini : free feature search library for Machine Learning
Automatically searches through thousands of ready-to-use features from public, community and commercial data sources and enriches your dataset with new external features in minutes
Live DEMO in Colab » |
Upgini.com |
Sign In |
Slack Community
❔ Overview
Upgini is a Python library for an automated features search to boost accuracy of supervised ML models on tabular data. It enriches your dataset with intelligently crafted features from a broad range of curated data sources, including public datasets and scraped data. The search is conducted for any combination of public IDs contained in your tabular dataset: IP, date, etc.
Only features that improve the prediction power of your ML model are returned.
Motivation: for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient search tools for external data blocks massive adoption of external features in ML pipelines.
We want radically simplify features search and delivery for ML pipelines to make external data a standard approach. Like a hyperparameter tuning for machine learning nowadays.
🚀 Awesome features
⭐️ Automatically find only features that give accuracy improvement for ML algorithm according to metrics: ROC AUC, RMSE, Accuracy. Not just correlated with target variable, which 9 out of 10 cases gives zero accuracy improvement for production ML cases
⭐️ Calculate accuracy metrics and uplifts if you'll enrich your existing ML model with found external features, right in search results
⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate risks of unstable external data dependencies in ML pipelines
⭐️ Scikit-learn compatible interface for quick data integration with your existing ML pipelines
⭐️ Curated and updated data sources, including public datasets and scraped data
⭐️ Support for several search key types (including date/datetime, SHA256 hashed email, IPv4, phone), more to come...
⭐️ Supported supervised ML tasks:
- ☑️ binary classification
- ☑️ multiclass classification
- ☑️ regression
- 🔜 time series prediction
- 🔜 recommender system
🏁 Quick start with Live demo
🏎 Live demo with Kaggle competition data
Notebook with a Kaggle example: kaggle_example.ipynb. The problem being solved is a Kaggle competition Store Item Demand Forecasting Challenge. The goal is to predict future sales of different goods in different stores based on a 5-year history of sales. The evaluation metric is SMAPE.
Launch notebook inside your browser:
Competition dataset was split into train (2013-2016 year) and test (2017 year) parts. FeaturesEnricher
was fitted on train part. And both datasets were enriched with external features. Finally, ML algorithm was fitted both of the initial and the enriched datasets to compare accuracy improvement. With a solid improvement of the evaluation metric achieved by the enriched ML model.
🐍 Install from PyPI
%pip install upgini
🐳 Docker-way
Clone $ git clone https://github.com/upgini/upgini
or download upgini git repo locally and follow steps below to build docker container 👇
Build docker image
- ... from cloned git repo:
cd upgini
docker build -t upgini .
- ...or directly from GitHub:
DOCKER_BUILDKIT=0 docker build -t upgini git@github.com:upgini/upgini.git#main
Run docker image:
docker run -p 8888:8888 upgini
Open http://localhost:8888?token=<your_token_from_console_output> in your browser
💻 How it works?
1. 💡 Use your existing labeled training datasets for search
You can use your existing labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:
- search keys from training dataset to match records from potential external data sources / features
- labels from training dataset to estimate relevancy of feature or dataset for your ML task and calculate feature importance metrics
- your features from training dataset to find external datasets and features only give accuracy improvement to your existing data and estimate accuracy uplift (optional)
Load training dataset into pandas dataframe and separate features' columns from label column in a Scikit-learn way:
import pandas as pd
# labeled training dataset - customer_churn_prediction_train.csv
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]
2. 🔦 Choose at least one column as a search key
Search keys columns will be used to match records from all potential external data sources / features 👓. Define at least one search key with FeaturesEnricher
class initialization.
from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher ( search_keys={"subscription_activation_date": SearchKey.DATE} )
✨ Search key types we support (more is coming!)
Our team works hard to introduce new search key types, currently we support:
Search Key Meaning Type |
Description | Example |
---|---|---|
SearchKey.EMAIL | support@upgini.com | |
SearchKey.HEM | sha256(lowercase(email)) | 0e2dfefcddc929933dcec9a5c7db7b172482814e63c80b8460b36a791384e955 |
SearchKey.IP | IP address (version 4) | 192.168.0.1 |
SearchKey.PHONE | phone number, E.164 standard | 443451925138 |
SearchKey.DATE | date |
2020-02-12 (ISO-8601 standard)
12.02.2020 (non standard notation) |
SearchKey.DATETIME | datetime | 2020-02-12 12:46:18 12:46:18 12.02.2020 unixtimestamp |
⚠️ Requirements for search initialization dataset
We do dataset verification and cleaning under the hood, but still there are some requirements to follow:
- Pandas dataframe representation
- Correct label column types: boolean/integers/strings for binary and multiclass labels, floats for regression
- At least one column defined as a search key
- Min size after deduplication by search key column and NaNs removal: 1000 records
- Max size after deduplication by search key column and NaNs removal: 1 000 000 records
3. 🔍 Start your first feature search!
The main abstraction you interact is FeaturesEnricher
. FeaturesEnricher
is a Scikit-learn compatible estimator, so you can easily add it into your existing ML pipelines. First, create instance of the FeaturesEnricher
class. Once it created call
fit
to search relevant datasets & features- than
transform
to enrich your dataset with features from search result
Let's try it out!
import pandas as pd
from upgini import FeaturesEnricher, SearchKey
# load labeled training dataset to initiate search
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]
# now we're going to create `FeaturesEnricher` class
enricher = FeaturesEnricher( search_keys={"subscription_activation_date": SearchKey.DATE} )
# everything is ready to fit! For 200к records fitting should take around 10 minutes,
# we can send email notification, just register with email
enricher.fit(X, y)
That's all). We have fitted FeaturesEnricher
and any pandas dataframe, with exactly the same data schema, can be enriched with features from search results. Use transform
method, and let magic to do the rest 🪄
# load dataset for enrichment
test_x = pd.read_csv("test.csv")
# enrich it!
enriched_test_features = enricher.transform(test_x)
enriched_test_features.head()
4. 📈 Evaluate feature importances (SHAP values) from the search result
FeaturesEnricher
class has two properties for feature importances, which will be filled after fit - feature_names_
and feature_importances_
:
feature_names_
- feature names from the search result, and if parameterkeep_input=True
was used, initial columns from search dataset as wellfeature_importances_
- SHAP values for features from the search result, same order as infeature_names_
And also has method get_features_info()
which will return pandas dataframe with features and full statistics after fit, including SHAP values and match rates:
enricher.get_features_info().sort_values("shap_value",ascending=False).head(10)
You can get more details about FeaturesEnricher
in runtime using docstrings, for example, via help(FeaturesEnricher)
or help(FeaturesEnricher.fit)
.
✅ Optional: find features only give accuracy gain to your existing data in the ML model
If you already have features or other external data sources, you can specifically search new datasets & features only give accuracy gain "on top" of them.
Just leave all these existing features in the labeled training dataset and Upgini library automatically use them during feature search process and as a baseline ML model to calculate accuracy metric uplift. And won't return any features that might not give an accuracy gain to the existing feature space.
✅ Optional: check robustness of accuracy improvement from external features
You can validate external features robustness on out-of-time dataset using eval_set
parameter. Let's do that:
# load train dataset
train_df = pd.read_csv("train.csv")
train_ids_and_features = train_df.drop(columns="label")
train_label = train_df["label"]
# load out-of-time validation dataset
eval_df = pd.read_csv("validation.csv")
eval_ids_and_features = eval_df.drop(columns="label")
eval_label = eval_df["label"]
# create FeaturesEnricher
enricher = FeaturesEnricher( search_keys={"registration_date": SearchKey.DATE} )
# now we fit WITH eval_set parameter to calculate accuracy metrics on OOT dataset.
# the output will contain quality metrics for both the training data set and
# the eval set (validation OOT data set)
enricher.fit(
train_ids_and_features,
train_label,
eval_set = [(eval_ids_and_features, eval_label)]
)
⚠️ Requirements for out-of-time dataset
- Same data schema as for search initialization dataset
- Pandas dataframe representation
✅ Optional: return initial dataframe enriched with TOP external features by importance
FeaturesEnricher
can be used with fit_transform
method and two parameters:
importance_threshold
: float = 0 - only features with importance >= threshold will be added to the output dataframemax_features
: int - only first TOP N features by importance will be returned, where N = max_features
And keep_input=True
will keep all initial columns from search dataset X:
enricher = FeaturesEnricher(
search_keys={"subscription_activation_date": SearchKey.DATE},
keep_input=True,
max_features=2,
)
enriched_dataframe.fit_transform(X, y)
✅ Optional: reuse completed enrichment
FeaturesEnricher
can be used with search id of completed state:
search_id
: str - id of completed fit operation (enricher.get_search_id()
) Search keys and features in X should be the same as on fit
enricher = FeaturesEnricher(
search_keys={"date": SearchKey.DATE},
search_id = "abcdef00-0000-0000-0000-999999999999"
)
enricher.transform(X)
🧹 Search dataset validation
We validate and clean search initialization dataset under the hood:
✂️ Check you search keys columns format
✂️ Check zero variance for label column
✂️ Check dataset for full row duplicates. If we find any, we remove duplicated rows and make a note on share of row duplicates
✂️ Check inconsistent labels - rows with the same features and keys but different labels, we remove them and make a note on share of row duplicates
✂️ Remove columns with zero variance - we treat any non search key column in search dataset as a feature, so columns with zero variance will be removed
❔ Supervised ML tasks detection
We detect ML task under the hood based on label column values. Currently we support:
- ModelTaskType.BINARY
- ModelTaskType.MULTICLASS
- ModelTaskType.REGRESSION
In most cases, you don't need to do anything, but for certain search datasets, this detection might fail.
In this case, you can pass parameter to FeaturesEnricher
with correct ML taks type:
from upgini import ModelTaskType
enricher = FeaturesEnricher(
search_keys={"subscription_activation_date": SearchKey.DATE},
model_task_type=ModelTaskType.REGRESSION
)
🆙 Accuracy and uplift metrics calculations
We calculate all the accuracy metrics and uplifts for non-linear machine learning algorithms, like gradient boosting or neural networks. If your external data consumer is a linear ML algorithm (like log regression), you might notice different accuracy metrics after data enrichment.
💸 Is it a free or paid service?
We have three types of data sources with pre-computed features: Public data, Community data and Commerical data:
- Public data is free. Both features search and usage.
- Community data is free of charge, if you share your royalty / license free datasets or features with DS community.
- Commercial data is paid, as their owners set a price tag. We have no influence on this price policy.
How can I share data/features to get free access to community data?
If you have ANY data which you might consider as royalty / license free (Open Data) and potentially valuable for supervised ML applications, you may publish it for community usage and get free access for community data tier:
- Please Sign Up here
- Copy Upgini API key from profile and upload your data from Upgini python library with this key:
import pandas as pd
from upgini import SearchKey
from upgini.ads import upload_user_ads
import os
os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
#you can define custom search key which might not be supported yet, just use SearchKey.CUSTOM_KEY type
sample_df = pd.read_csv("path_to_data_sample_file")
upload_user_ads("test", sample_df, {
"city": SearchKey.CUSTOM_KEY, "stats_date": SearchKey.DATE
})
- After data verification, search results on community data will be available usual way.
If I can help with testing or development, will I get community data for free?
Yes, participate in beta testing and get credits for Upgini usage! Now service is still in a beta stage, so registered beta testers will get free community data access for 6 months. Feel free to start with the registration form 👉 here
Please note, that number of slots for beta testing is limited and we wont' be able to handle all the requests.
🛠 Getting Help & Community
Requests and support, in preferred order
Please try to create bug reports that are:
- Reproducible. Include steps to reproduce the problem.
- Specific. Include as much detail as possible: which Python version, what environment, etc.
- Unique. Do not duplicate existing opened issues.
- Scoped to a Single Bug. One bug per report.
🧩 Contributing
We are a very small team and this is a part-time project for us, thus most probably we won't be able:
- implement ALL the data delivery and integration interfaces for most common ML stacks and frameworks
- implement ALL data verification and normalization capabilities for different types of search keys (we just started with current 4)
And we might need some help from community) So, we'll be happy about every pull request you open and issue you find to make this library more awesome. Please note that it might sometimes take us a while to get back to you. For major changes, please open an issue first to discuss what you would like to change
Developing
Some convenient ways to start contributing are:
⚙️ Visual Studio Code You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container.
⚙️ Gitpod You can use Gitpod to launch a fully functional development environment right in your browser.
🔗 Useful links
😔 Found mistype or a bug in code snippet? Our bad! Please report it here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for upgini-0.10.0a82-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81af91d0e0db34104ad2df75464138fc8647dd28d174c2a71529c50ff3faa062 |
|
MD5 | b0edbb66d980310d700a84ebd1caaf30 |
|
BLAKE2b-256 | d40c9a8369eadc6624aa69ce5f5894f506954c0c0a772cc6af7bc29d001d1047 |