Skip to main content

Low-code feature search and enrichment library for machine learning

Project description

Upgini • Free production-ready automated data enrichment library for machine learning

Automatically searches through thousands of ready-to-use features from public and community data sources and
enriches your ML pipeline with only the relevant features


Quick Start in Colab » | Upgini.com | Sign In | Slack Community | Propose new Data source

BSD-3 license PyPI - Python Version PyPI Downloads from pypistats Upgini slack community

❔ Overview

Upgini is a Python data enrichment library that automatically finds only relevant features to improve performance of ML model. It searches through thousands of features and data sources, including public datasets and scraped data shared by Data science community. Save your time on external data search and engineering, just use your labeled dataset to initiate search and Upgini will do the rest.

Motivation: for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient search tools for external data blocks massive adoption of external features in ML pipelines.
We want radically simplify features search and delivery to make external data a standard approach. Like a hyperparameter tuning for machine learning nowadays.

Mission: Democratize access to data sources for data science community.

🚀 Awesome features

⭐️ Automatically find only relevant features that give accuracy improvement for ML model. Not just correlated with target variable, what 9 out of 10 cases gives zero accuracy improvement
⭐️ Calculate accuracy metrics and uplifts after enrichment existing ML model with external features
⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate risks of unstable external data dependencies in ML pipeline
⭐️ Curated and updated data sources, including public datasets and community-shared data
⭐️ Easy to use - single request to enrich training dataset with all of the keys at once:

date / datetime phone number
postal / ZIP code hashed email / HEM
country IP-address

⭐️ Scikit-learn compatible interface for quick data integration with existing ML pipelines
⭐️ Support for most common supervised ML tasks on tabular data:

☑️ binary classification ☑️ multiclass classification
☑️ regression ☑️ time series prediction

🌎 Connected data sources and coverage

  • Public data : public sector, academic institutions, other sources through open data portals
  • Community shared data: royalty / license free datasets or features from Data science community (our users). It's both a public and a scraped data.

👉 Details on datasets and features are here

📊 Total: 239 countries and up to 41 years of history

Data scource Countries History, years Update Search keys API Key required
Historical weather & Climate normals 68 22 Monthly date, country, postal/ZIP code No
Location/Places/POI/Area information from OpenStreetMap 221 2 Monthly date, country, postal/ZIP code No
International holidays & events, Workweek calendar 232 22 Monthly date, country No
Consumer Confidence index 44 22 Monthly date, country No
World economic indicators 191 41 Monthly date, country No
Markets data - 17 Monthly date, datetime No
World mobile network coverage 167 - Monthly country, postal/ZIP code No
World demographic data 90 - Annual country, postal/ZIP code No
World house prices 44 - Annual country, postal/ZIP code No
Public social media profile data 104 - Monthly date, email/HEM, phone Yes
Car ownership data and Parking statistics 3 - Annual country, postal/ZIP code, email/HEM, phone Yes
Geolocation profile for phone & IPv4 & email 239 - Monthly date, email/HEM, phone, IPv4 Yes
🔜 Email/WWW domain profile - - - -

Know other useful data sources for machine learning? Give us a hint and we'll add it for free.

:briefcase: Use-cases

🏁 Simple sales predictions (use as a template)

Search new features for Kaggle Store Item Demand Forecasting Challenge. The goal is to predict future sales of different goods in stores based on a 5-year history of sales. The evaluation metric is SMAPE.
Run quick start guide notebook inside your browser:

Open example in Google Colab   Open in Binder  

Competition dataset was split into train (2013-2016 year) and test (2017 year) parts. FeaturesEnricher was fitted on train part. Both datasets were enriched with external features. To compare accuracy improvement ML model was fitted both of the initial and the enriched datasets. Evaluation metric was significantly improved by the enriched ML model.

:question: How to boost ML model accuracy for Kaggle TOP1 Leaderboard in 10 minutes

  • The goal is accuracy improvement for TOP1 winning Kaggle solution from new relevant external features & data.
  • Kaggle Competition is a product sales forecasting, evaluation metric is SMAPE.

:question: How to do low-code feature engineering for AutoML tools

  • Save time on feature search and engineering. Use ready-to-use external features and data sources to maximize overall AutoML accuracy, right out of the box.
  • Kaggle Competition is a product sales forecasting, evaluation metric is SMAPE.
  • Low-code AutoML tools: Upgini and PyCaret

:question: How to improve accuracy of Multivariate Time Series forecast from external features & data

  • The goal is accuracy improvement of Multivariate Time Series prediction from new relevant external features & data. The main challenge here is a strategy of data & feature enrichment, when a component of Multivariate TS depends not only on its past values but also has some dependency on other components.
  • Kaggle Competition is a product sales forecasting, evaluation metric is RMSLE.

:question: How to speed up feature engineering hypothesis tests with ready-to-use external features

  • Save time on external data wrangling and feature calculation code for hypothesis tests. The key challenge here is a time-dependent representation of information in a training dataset, which is uncommon for credit default prediction tasks. As a result, special data enrichment strategy is used.
  • Kaggle Competition is a credit default prediction, evaluation metric is normalized Gini coefficient.

🏁 Quick start

1. Install from PyPI

%pip install upgini
🐳 Docker-way
Clone $ git clone https://github.com/upgini/upgini or download upgini git repo locally
and follow steps below to build docker container 👇

1. Build docker image from cloned git repo:
cd upgini
docker build -t upgini .


...or directly from GitHub:
DOCKER_BUILDKIT=0 docker build -t upgini
git@github.com:upgini/upgini.git#main

2. Run docker image:
docker run -p 8888:8888 upgini

3. Open http://localhost:8888?token="<"your_token_from_console_output">" in your browser

2. 💡 Use your labeled training dataset for search

You can use your labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:

  • search keys from training dataset to match records from potential data sources with a new features
  • labels from training dataset to estimate relevancy of feature or dataset for your ML task and calculate feature importance metrics
  • your features from training dataset to find external datasets and features which only give accuracy improvement to your existing data and estimate accuracy uplift (optional)

Load training dataset into pandas dataframe and separate features' columns from label column in a Scikit-learn way:

import pandas as pd
# labeled training dataset - customer_churn_prediction_train.csv
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]
⚠️ Requirements for search initialization dataset
We do dataset verification and cleaning under the hood, but still there are some requirements to follow:
1. pandas.DataFrame, pandas.Series or numpy.ndarray representation;
2. correct label column types: boolean/integers/strings for binary and multiclass labels, floats for regression;
3. at least one column selected as a search key;
4. min size after deduplication by search key column and NaNs removal: 100 records

3. 🔦 Choose one or multiple columns as a search keys

Search keys columns will be used to match records from all potential external data sources / features.
Define one or multiple columns as a search keys with FeaturesEnricher class initialization.

from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher(
	search_keys={
		"subscription_activation_date": SearchKey.DATE,
    		"country": SearchKey.COUNTRY,
    		"zip_code": SearchKey.POSTAL_CODE,
    		"hashed_email": SearchKey.HEM,
		"last_visit_ip_address": SearchKey.IP,
		"registered_with_phone": SearchKey.PHONE
	})

✨ Search key types we support (more to come!)

Search Key
Meaning Type
Description Allowed pandas dtypes (python types) Example
SearchKey.EMAIL e-mail object(str)
string
support@upgini.com
SearchKey.HEM sha256(lowercase(email)) object(str)
string
0e2dfefcddc929933dcec9a5c7db7b172482814e63c80b8460b36a791384e955
SearchKey.IP IP address (version 4) object(str, ipaddress.IPv4Address)
string
int64
192.168.0.1
SearchKey.PHONE phone number, E.164 standard object(str)
string
int64
float64
443451925138
SearchKey.DATE date object(str)
string
datetime64[ns]
period[D]
2020-02-12  (ISO-8601 standard)
12.02.2020  (non standard notation)
SearchKey.DATETIME datetime object(str)
string
datetime64[ns]
period[D]
2020-02-12 12:46:18
12:46:18 12.02.2020
SearchKey.COUNTRY Country ISO-3166 code, Country name object(str)
string
GB
US
IN
SearchKey.POSTAL_CODE Postal code a.k.a. ZIP code. Could be used only with SearchKey.COUNTRY object(str)
string
21174
061107
SE-999-99

For the meaning types SearchKey.DATE/SearchKey.DATETIME with dtypes object or string you have to clarify date/datetime format by passing date_format parameter to FeaturesEnricher. For example:

from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher(
	search_keys={
		"subscription_activation_date": SearchKey.DATE,
    		"country": SearchKey.COUNTRY,
    		"zip_code": SearchKey.POSTAL_CODE,
    		"hashed_email": SearchKey.HEM,
		"last_visit_ip_address": SearchKey.IP,
		"registered_with_phone": SearchKey.PHONE
	}, 
	date_format = "%Y-%d-%m"
)

To use datetime not in UTC timezone, you can cast datetime column explicitly to your timezone (example for Warsaw):

df["date"] = df.date.astype("datetime64").dt.tz_localize("Europe/Warsaw")

Single country for the whole training dataset can be passed with country_code parameter:

from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher(
	search_keys={
		"subscription_activation_date": SearchKey.DATE,
    		"zip_code": SearchKey.POSTAL_CODE,
	}, 
	country_code = "US",
	date_format = "%Y-%d-%m"
)

4. 🔍 Start your first feature search!

The main abstraction you interact is FeaturesEnricher, a Scikit-learn compatible estimator. You can easily add it into your existing ML pipelines. Create instance of the FeaturesEnricher class and call:

  • fit to search relevant datasets & features
  • than transform to enrich your dataset with features from search result

Let's try it out!

import pandas as pd
from upgini import FeaturesEnricher, SearchKey

# load labeled training dataset to initiate search
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]

# now we're going to create `FeaturesEnricher` class
enricher = FeaturesEnricher(
	search_keys={
		"subscription_activation_date": SearchKey.DATE,
    		"country": SearchKey.COUNTRY,
    		"zip_code": SearchKey.POSTAL_CODE
	})

# everything is ready to fit! For 200к records fitting should take around 10 minutes,
# we send email notification, just register on profile.upgini.com
enricher.fit(X, y)

That's all). We've fitted FeaturesEnricher.

5. 📈 Evaluate feature importances (SHAP values) from the search result

FeaturesEnricher class has two properties for feature importances, which will be filled after fit - feature_names_ and feature_importances_:

  • feature_names_ - feature names from the search result, and if parameter keep_input=True was used, initial columns from search dataset as well
  • feature_importances_ - SHAP values for features from the search result, same order as in feature_names_

Method get_features_info() returns pandas dataframe with features and full statistics after fit, including SHAP values and match rates:

enricher.get_features_info()

Get more details about FeaturesEnricher at runtime using docstrings via help(FeaturesEnricher) or help(FeaturesEnricher.fit).

6. 🏭 Enrich Production ML pipeline with relevant external features

FeaturesEnricher is a Scikit-learn compatible estimator, so any pandas dataframe can be enriched with external features from a search result (after fit ).
Use transform method of FeaturesEnricher , and let magic to do the rest 🪄

# load dataset for enrichment
test_x = pd.read_csv("test.csv")
# enrich it!
enriched_test_features = enricher.transform(test_x)

6.1 Reuse completed search for enrichment without 'fit' run

FeaturesEnricher can be initiated with a search_id parameter from completed search after fit method call.
Just use enricher.get_search_id() or copy search id string from the fit() output.
Search keys and features in X should be the same as for fit()

enricher = FeaturesEnricher(
  #same set of a search keys as for the fit step
  search_keys={"date": SearchKey.DATE},
  search_id = "abcdef00-0000-0000-0000-999999999999"
)
enriched_prod_dataframe=enricher.transform(input_dataframe)

6.2 Enrichment with an updated external data sources and features

For most of the ML cases, training step requires labeled dataset with a historical observations from the past. But for production step you'll need an updated and actual data sources and features for the present time, to calculate a prediction.
FeaturesEnricher, when initiated with set of search keys which includes SearchKey.DATE, will match records from all potential external data sources exactly on a the specific date/datetime based on SearchKey.DATE. To avoid enrichment with features "form the future" for the fit step.
And then, for transform in a production ML pipeline, you'll get enrichment with relevant features, actual for the present date.

⚠️ Initiate FeaturesEnricher with SearchKey.DATE search key in a key set to get actual features for production and avoid features from the future for the training:

enricher = FeaturesEnricher(
	search_keys={
		"subscription_activation_date": SearchKey.DATE,
    		"country": SearchKey.COUNTRY,
    		"zip_code": SearchKey.POSTAL_CODE,
	},
) 

💻 How it works?

🧹 Search dataset validation

We validate and clean search initialization dataset under the hood:

  • сheck you search keys columns format;
  • check zero variance for label column;
  • check dataset for full row duplicates. If we find any, we remove duplicated rows and make a note on share of row duplicates;
  • check inconsistent labels - rows with the same features and keys but different labels, we remove them and make a note on share of row duplicates;
  • remove columns with zero variance - we treat any non search key column in search dataset as a feature, so columns with zero variance will be removed

❔ Supervised ML tasks detection

We detect ML task under the hood based on label column values. Currently we support:

  • ModelTaskType.BINARY
  • ModelTaskType.MULTICLASS
  • ModelTaskType.REGRESSION

But for certain search datasets you can pass parameter to FeaturesEnricher with correct ML taks type:

from upgini import ModelTaskType
enricher = FeaturesEnricher(
	search_keys={"subscription_activation_date": SearchKey.DATE},
	model_task_type=ModelTaskType.REGRESSION
)

⏰ Time Series prediction support

Time series prediction supported as ModelTaskType.REGRESSION or ModelTaskType.BINARY tasks with time series specific cross-validation split:

To initiate feature search you can pass cross-validation type parameter to FeaturesEnricher with time series specific CV type:

from upgini.metadata import CVType
enricher = FeaturesEnricher(
	search_keys={"sales_date": SearchKey.DATE},
	cv=CVType.time_series
)

⚠️ Pre-process search dataset in case of time series prediction:
sort rows in dataset according to observation order, in most cases - ascending order by date/datetime.

🆙 Accuracy and uplift metrics calculations

FeaturesEnricher automaticaly calculates model metrics and uplift from new relevant features either using calculate_metrics() method or calculate_metrics=True parameter in fit or fit_transform methods (example below).
You can use any model estimator with scikit-learn compartible interface, some examples are:

👈 Evaluation metric should be passed to calculate_metrics() by scoring parameter,
out-of-the box Upgini supports
Metric Description
explained_variance Explained variance regression score function
r2 R2 (coefficient of determination) regression score function
max_error Calculates the maximum residual error (negative - greater is better)
median_absolute_error Median absolute error regression loss
mean_absolute_error Mean absolute error regression loss
mean_absolute_percentage_error Mean absolute percentage error regression loss
mean_squared_error Mean squared error regression loss
mean_squared_log_error (or aliases: msle, MSLE) Mean squared logarithmic error regression loss
root_mean_squared_log_error (or aliases: rmsle, RMSLE) Root mean squared logarithmic error regression loss
root_mean_squared_error Root mean squared error regression loss
mean_poisson_deviance Mean Poisson deviance regression loss
mean_gamma_deviance Mean Gamma deviance regression loss
accuracy Accuracy classification score
top_k_accuracy Top-k Accuracy classification score
roc_auc Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores
roc_auc_ovr Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores (multi_class="ovr")
roc_auc_ovo Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores (multi_class="ovo")
roc_auc_ovr_weighted Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores (multi_class="ovr", average="weighted")
roc_auc_ovo_weighted Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores (multi_class="ovo", average="weighted")
balanced_accuracy Compute the balanced accuracy
average_precision Compute average precision (AP) from prediction scores
log_loss Log loss, aka logistic loss or cross-entropy loss
brier_score Compute the Brier score loss

In addition to that list, you can define custom evaluation metric function using scikit-learn make_scorer, for example SMAPE.

By default, calculate_metrics() method calculates evaluation metric with the same cross-validation split as selected for FeaturesEnricher.fit() by parameter cv = CVType.<cross-validation-split>.
But you can easily define new split by passing child of BaseCrossValidator to parameter cv in calculate_metrics().

Example with more tips-and-tricks:

from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})

# Fit with default setup for metrics calculation
# CatBoost will be used
enricher.fit(X, y, eval_set=eval_set, calculate_metrics=True)

# LightGBM estimator for metrics
custom_estimator = LGBMRegressor()
enricher.calculate_metrics(estimator=custom_estimator)

# Custom metric function to scoring param (callable or name)
custom_scoring = "RMSLE"
enricher.calculate_metrics(scoring=custom_scoring)

# Custom cross validator
custom_cv = TimeSeriesSplit(n_splits=5)
enricher.calculate_metrics(cv=custom_cv)

# All this custom parameters could be combined in both methods: fit, fit_transform and calculate_metrics:
enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator, scoring=custom_scoring, cv=custom_cv)

✅ More tips-and-tricks

Find features only give accuracy gain to existing data in the ML model

If you already have features or other external data sources, you can specifically search new datasets & features only give accuracy gain "on top" of them.

Just leave all these existing features in the labeled training dataset and Upgini library automatically use them during feature search process and as a baseline ML model to calculate accuracy metric uplift. Only features which improve accuracy will return.

Check robustness of accuracy improvement from external features

You can validate external features robustness on out-of-time dataset using eval_set parameter:

# load train dataset
train_df = pd.read_csv("train.csv")
train_ids_and_features = train_df.drop(columns="label")
train_label = train_df["label"]

# load out-of-time validation dataset
eval_df = pd.read_csv("validation.csv")
eval_ids_and_features = eval_df.drop(columns="label")
eval_label = eval_df["label"]
# create FeaturesEnricher
enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})

# now we fit WITH eval_set parameter to calculate accuracy metrics on Out-of-time dataset.
# the output will contain quality metrics for both the training data set and
# the eval set (validation OOT data set)
enricher.fit(
  train_ids_and_features,
  train_label,
  eval_set = [(eval_ids_and_features, eval_label)]
)

⚠️ Requirements for out-of-time dataset

  • Same data schema as for search initialization dataset
  • Pandas dataframe representation

Return initial dataframe enriched with TOP external features by importance

transform and fit_transform methods of FeaturesEnricher can be used with two additional parameters:

  • importance_threshold: float = 0 - only features with importance >= threshold will be added to the output dataframe
  • max_features: int - only first TOP N features by importance will be returned, where N = max_features

And keep_input=True will keep all initial columns from search dataset X:

enricher = FeaturesEnricher(
	search_keys={"subscription_activation_date": SearchKey.DATE}
)
enriched_dataframe.fit_transform(X, y, keep_input=True, max_features=2)

Exclude features sources from fit, transform and calculation of metrics

fit, fit_transform, transform and calculate_metrics methods of FeaturesEnricher can be used with parameter exclude_features_sources that allows to exclude Trial or Paid features that cannot be used for enrichment or calculation of metrics:

enricher = FeaturesEnricher(
  search_keys={"subscription_activation_date": SearchKey.DATE}
)
enricher.fit(X, y, calculate_metrics=False)
trial_features = enricher.get_features_info()[enricher.get_features_info()["Feature type"] == "Trial"]["Feature name"].values.tolist()
paid_features = enricher.get_features_info()[enricher.get_features_info()["Feature type"] == "Paid"]["Feature name"].values.tolist()
enricher.calculate_metrics(exclude_features_sources=(trial_features + paid_features))
enricher.transform(X, exclude_features_sources=(trial_features + paid_features))

Turn off autodetection for search key columns

Upgini has autodetection of search keys on by default. To turn off use detect_missing_search_keys=False:

enricher = FeaturesEnricher(
   search_keys={"date": SearchKey.DATE},
   detect_missing_search_keys=False,
)

enricher.fit(X, y)

🔑 Open up all capabilities of Upgini

Register and get a free API key for exclusive data sources and features: 600 mln+ phone numbers, 350 mln+ emails, 2^32 IP addresses

Benefit No Sign-up Registered user
Enrichment with date/datetime, postal/ZIP code and country keys Yes Yes
Enrichment with phone number, hashed email/HEM and IP-address keys No Yes
Email notification on search task completion No Yes
Email notification on new data source activation 🔜 No Yes

👩🏻‍💻 How to share data/features with a community ?

You may publish ANY data which you consider as royalty / license free (Open Data) and potentially valuable for ML applications for community usage:

  1. Please Sign Up here
  2. Copy Upgini API key from profile and upload your data from Upgini python library with this key:
import pandas as pd
from upgini import SearchKey
from upgini.ads import upload_user_ads
import os
os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
#you can define custom search key which might not be supported yet, just use SearchKey.CUSTOM_KEY type
sample_df = pd.read_csv("path_to_data_sample_file")
upload_user_ads("test", sample_df, {
    "city": SearchKey.CUSTOM_KEY,
    "stats_date": SearchKey.DATE
})
  1. After data verification, search results on community data will be available usual way.

🛠 Getting Help & Community

Please note, that we are still in a beta stage. Requests and support, in preferred order
Claim help in slack Open GitHub issue

❗Please try to create bug reports that are:

  • reproducible - include steps to reproduce the problem.
  • specific - include as much detail as possible: which Python version, what environment, etc.
  • unique - do not duplicate existing opened issues.
  • scoped to a Single Bug - one bug per report.

🧩 Contributing

We are a very small team and this is a part-time project for us, thus most probably we won't be able:

  • implement smooth integration with most common low-code ML libraries and platforms (PyCaret, H2O AutoML, etc. )
  • implement all possible data verification and normalization capabilities for different types of search keys (we just started with current 6 types)

And we need some help from community) So, we'll be happy about every pull request you open and issue you find to make this library more incredible. Please note that it might sometimes take us a while to get back to you. For major changes, please open an issue first to discuss what you would like to change

Developing

Some convenient ways to start contributing are:
⚙️ Open in Visual Studio Code You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container.
⚙️ Gitpod Gitpod Ready-to-Code You can use Gitpod to launch a fully functional development environment right in your browser.

🔗 Useful links

😔 Found mistype or a bug in code snippet? Our bad! Please report it here.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

upgini-1.1.108.tar.gz (118.9 kB view hashes)

Uploaded Source

Built Distribution

upgini-1.1.108-py3-none-any.whl (103.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page