Run FeatureTools to automate Feature Engineering distributionally on Spark.
Project description
## FeatureTools for Spark (featuretools4s)
### 1. What's FeatureTools?
FeatureTools is a Python library open-sourced by MIT's
FeatureLab aiming to automate
the process of feature engineering in Machine Learning
applications.
Please visit the [official website](https://docs.featuretools.com/index.html)
for more details about FeatureTools.
*FeatureTools4S* is a Python library written by me aiming to scale
FeatureTools with **Spark**, making it capable of generating
features for billions of rows of data, which is usually
considered impossible to process on single machine using
original FeatureTools library with Pandas.
*FeatureTools4S* provides **almost the same** API as original
FeatureTools, which make its users completely free of transferring
between FeatureTools and FeatureTools4S. **Hence we suggest the readers
first to learn FeatureTools and then you can easily work on FeatureTools4S.**
### 2. How to use FeatureTools4S?
First install *featuretools4s* through pip:
```bash
pip3 install featuretools4s
```
Then a simple example of using *featuretools4s* is as follows:
```python
import featuretools4s as fts
from pyspark.sql import SparkSession
import os
import pandas as pd
os.environ["SPARK_HOME"] = "C:\Python36\Lib\site-packages\pyspark"
os.environ["PATH"] = "C:\Python36;" + os.environ["PATH"]
pd.set_option('display.expand_frame_repr', False)
spark = SparkSession.builder.master("local[*]").getOrCreate()
order_df = spark.read.csv("resources/order.csv", header=True, inferSchema=True).sort("sales_tax")
customer_df = spark.read.csv("resources/customer.csv", header=True, inferSchema=True)
es = fts.EntitySetSpark(id="test")
es.entity_from_dataframe("order", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("customer", customer_df, index="cust_num")
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order"]["cust_num"]))
features = fts.dfs(spark, entityset=es, target_entity="customer", primary_col="cust_num", num_partition=5)
features.show()
```
### 1. What's FeatureTools?
FeatureTools is a Python library open-sourced by MIT's
FeatureLab aiming to automate
the process of feature engineering in Machine Learning
applications.
Please visit the [official website](https://docs.featuretools.com/index.html)
for more details about FeatureTools.
*FeatureTools4S* is a Python library written by me aiming to scale
FeatureTools with **Spark**, making it capable of generating
features for billions of rows of data, which is usually
considered impossible to process on single machine using
original FeatureTools library with Pandas.
*FeatureTools4S* provides **almost the same** API as original
FeatureTools, which make its users completely free of transferring
between FeatureTools and FeatureTools4S. **Hence we suggest the readers
first to learn FeatureTools and then you can easily work on FeatureTools4S.**
### 2. How to use FeatureTools4S?
First install *featuretools4s* through pip:
```bash
pip3 install featuretools4s
```
Then a simple example of using *featuretools4s* is as follows:
```python
import featuretools4s as fts
from pyspark.sql import SparkSession
import os
import pandas as pd
os.environ["SPARK_HOME"] = "C:\Python36\Lib\site-packages\pyspark"
os.environ["PATH"] = "C:\Python36;" + os.environ["PATH"]
pd.set_option('display.expand_frame_repr', False)
spark = SparkSession.builder.master("local[*]").getOrCreate()
order_df = spark.read.csv("resources/order.csv", header=True, inferSchema=True).sort("sales_tax")
customer_df = spark.read.csv("resources/customer.csv", header=True, inferSchema=True)
es = fts.EntitySetSpark(id="test")
es.entity_from_dataframe("order", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("customer", customer_df, index="cust_num")
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order"]["cust_num"]))
features = fts.dfs(spark, entityset=es, target_entity="customer", primary_col="cust_num", num_partition=5)
features.show()
```
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
featuretools4s-0.1.5.tar.gz
(5.0 kB
view details)
File details
Details for the file featuretools4s-0.1.5.tar.gz
.
File metadata
- Download URL: featuretools4s-0.1.5.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 676bffb6a7fa8d49dde81e2c3ce00fbf766a00f2fc7ace9218cc4c116a8194a5 |
|
MD5 | f1db433307473ff8a9e4131c61796903 |
|
BLAKE2b-256 | f8627bc8a472ff1472924ba379092542c609871382280082011dda4c50d81d12 |