Skip to main content

Run FeatureTools to automate Feature Engineering distributionally on Spark.

Project description

## FeatureTools for Spark (featuretools4s)

### 1. What's FeatureTools?
FeatureTools is a Python library open-sourced by MIT's
FeatureLab aiming to automate
the process of feature engineering in Machine Learning
applications.

Please visit the [official website](https://docs.featuretools.com/index.html)
for more details about FeatureTools.

*FeatureTools4S* is a Python library written by me aiming to scale
FeatureTools with **Spark**, making it capable of generating
features for billions of rows of data, which is usually
considered impossible to process on single machine using
original FeatureTools library with Pandas.

*FeatureTools4S* provides **almost the same** API as original
FeatureTools, which make its users completely free of transferring
between FeatureTools and FeatureTools4S. **Hence we suggest the readers
first to learn FeatureTools and then you can easily work on FeatureTools4S.**

### 2. How to use FeatureTools4S?
First install *featuretools4s* through pip:
```bash
pip3 install featuretools4s
```

Then a simple example of using *featuretools4s* is as follows:
```python
import featuretools4s as fts
from pyspark.sql import SparkSession

import os
import pandas as pd

os.environ["SPARK_HOME"] = "C:\Python36\Lib\site-packages\pyspark"
os.environ["PATH"] = "C:\Python36;" + os.environ["PATH"]
pd.set_option('display.expand_frame_repr', False)
spark = SparkSession.builder.master("local[*]").getOrCreate()

order_df = spark.read.csv("resources/order.csv", header=True, inferSchema=True).sort("sales_tax")
customer_df = spark.read.csv("resources/customer.csv", header=True, inferSchema=True)

es = fts.EntitySetSpark(id="test")
es.entity_from_dataframe("order", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("customer", customer_df, index="cust_num")
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order"]["cust_num"]))

features = fts.dfs(spark, entityset=es, target_entity="customer", primary_col="cust_num", num_partition=5)
features.show()
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featuretools4s-0.1.5.tar.gz (5.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page