Skip to main content

Run FeatureTools to automate Feature Engineering distributionally on Spark.

Project description

## FeatureTools for Spark (featuretools4s)

### 1. What's FeatureTools?
FeatureTools is a Python library open-sourced by MIT's
FeatureLab aiming to automate
the process of feature engineering in Machine Learning
applications.

Please visit the [official website](https://docs.featuretools.com/index.html)
for more details about FeatureTools.

*FeatureTools4S* is a Python library written by me aiming to scale
FeatureTools with **Spark**, making it capable of generating
features for billions of rows of data, which is usually
considered impossible to process on single machine using
original FeatureTools library with Pandas.

*FeatureTools4S* provides **almost the same** API as original
FeatureTools, which make its users completely free of transferring
between FeatureTools and FeatureTools4S. **Hence we suggest the readers
first to learn FeatureTools and then you can easily work on FeatureTools4S.**

### 2. How to use FeatureTools4S?
First install *featuretools4s* through pip:
```bash
pip3 install featuretools4s
```

Then a simple example of using *featuretools4s* is as follows:
```python
import featuretools4s as fts
from pyspark.sql import SparkSession

import os
import pandas as pd

os.environ["SPARK_HOME"] = "C:\Python36\Lib\site-packages\pyspark"
os.environ["PATH"] = "C:\Python36;" + os.environ["PATH"]
pd.set_option('display.expand_frame_repr', False)
spark = SparkSession.builder.master("local[*]").getOrCreate()

order_df = spark.read.csv("resources/order.csv", header=True, inferSchema=True).sort("sales_tax")
customer_df = spark.read.csv("resources/customer.csv", header=True, inferSchema=True)

es = fts.EntitySetSpark(id="test")
es.entity_from_dataframe("order", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("customer", customer_df, index="cust_num")
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order"]["cust_num"]))

features = fts.dfs(spark, entityset=es, target_entity="customer", primary_col="cust_num", num_partition=5)
features.show()
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featuretools4s-0.1.5.tar.gz (5.0 kB view details)

Uploaded Source

File details

Details for the file featuretools4s-0.1.5.tar.gz.

File metadata

  • Download URL: featuretools4s-0.1.5.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for featuretools4s-0.1.5.tar.gz
Algorithm Hash digest
SHA256 676bffb6a7fa8d49dde81e2c3ce00fbf766a00f2fc7ace9218cc4c116a8194a5
MD5 f1db433307473ff8a9e4131c61796903
BLAKE2b-256 f8627bc8a472ff1472924ba379092542c609871382280082011dda4c50d81d12

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page