features-factory

A small python package that helps dealing with Machine Learning features

These details have not been verified by PyPI

Project links

Homepage

Project description

A small python package that helps dealing with Machine Learning features (using pandas).

Installing the package

Latest available code:

pip install git+https://gitlab.com/francesco-calcavecchia/features_factory.git

Specific version:

pip install git+https://gitlab.com/francesco-calcavecchia/features_factory.git@vX.Y.Z

Quickstart

Verify the input column country_code in a pandas DataFrame df:

feat = CountryCodeInputFeature('Country Code')
input_error = feat.verify_input(df)
input_error.has_missing_columns(), input_error.has_columns_with_nan(), input_error.has_columns_with_wrong_format()

Map a column into another through a lambda function:

original_feat = StringInputFeature('Full Name')  # declare a feature corresponding to the column
feat = OneComponentFeature('First Name', original_feat, lambda x: x.split(' ')[0])
enriched_df = feat.insert_into(df)

Map two columns into one using a lambda function:

original_feat1 = StringInputFeature('First Name')
original_feat2 = StringInputFeature('Second Name')
feat = TwoComponentFeature('Full Name', original_feat1, original_feat2, lambda r: r[0] + ' ' + r[1])
new_df = feat.insert_into(df)

Create a stack of features:

original_feat1 = StringInputFeature('First Name')
original_feat2 = StringInputFeature('Second Name')
feat = TwoComponentFeature('Full Name', original_feat1, original_feat2, lambda r: r[0] + ' ' + r[1])
stack = Stack([original_feat1, original_feat2, feat])

Use a stack like a list:

stack.add(feat)

stack.remove(feat)

print('Number of features in this stack = ', len(stack))

for feat in stack:
    print(feat.name())

feat = stack[1]

stack = stack1 + stack2
stack += stack3

A stack automatically ignores duplicates:

stack = Stack([feat1, feat2])
stack.add([feat1])
len(stack) == 2

Handy stack functionalities:

original_feat1 = StringInputFeature('First Name')
original_feat2 = StringInputFeature('Second Name')
feat = TwoComponentFeature('Full Name', original_feat1, original_feat2, lambda r: r[0] + ' ' + r[1])
stack = Stack([feat])
stack = stack.with_dependencies()
stack.names() == ['Full Name', 'First Name', 'Second Name']

original_feat1 = StringInputFeature('First Name')
original_feat2 = StringInputFeature('Second Name')
feat = TwoComponentFeature('Full Name', original_feat1, original_feat2, lambda r: r[0] + ' ' + r[1])
stack = Stack([feat1, feat2, feat])
stack = stack.only_inputs()
stack.names() == ['First Name', 'Second Name']

Verify multiple input columns:

stack = Stack([feat1, feat2, feat3])
input_error = stack.verify(df).get_input_data_error()

Create a stack of features, verify the input data, the feature dependencies, and insert the feature in the df:

# input features
distance = FloatInputFeature('Distance [m]')
duration = IntInputFeature('Duration [s]')
runner_first_name = StringInputFeature('Runner First Name')
runner_last_name = StringInputFeature('Runner Last Name')
runner_age = IntInputFeature('Runner Age')
# derived features
speed = TwoComponentFeature('Average Speed [km/h]', distance, duration,
                     lambda r: 3.6*r[0]/r[1])
full_name = TwoComponentFeature('Full Name', runner_first_name, runner_last_name,
                         lambda r: r[0] + ' ' + r[1])
full_name_with_age = TwoComponentFeature('Full Name With Age', full_name, runner_age,
                                  lambda r: r[0] + ' (age {})'.format(r[1]))
# final feature
summary = TwoComponentFeature('Summary', full_name_with_age, speed,
                       lambda r: 'The runner {} run with and average speed of {} km/h'.format(r[0], r[1]))
# create a stack
stack = Stack([summary]).with_dependencies()
# look for errors
stack_error = stack.verify(df)
# populate the df with all the features
if stack_error.is_empty():
    new_df = stack.insert_into(df)

Are you working with a moltitude of features and you need to apply the same operation to them? Check out the StackFactory class. E.g.

int1 = IntInputFeature('int1')
int2 = IntInputFeature('int2')
float1 = FloatInputFeature('float1')

names = ['2 x int1', '2 x int2', '2 x float1']
dependencies = [int1, int2, float1]
args = [{'name': name, 'dependency': feat, 'map_function': lambda x: 2*x}
        for name, feat in zip(names, dependencies)]
stack = StackFactory.clones(OneComponentFeature, args)

df = pd.DataFrame({int1.name(): [3, 5, 7], int2.name(): [15, 20, 50], float1.name(): [2.2, 0.1, 5.5]})
df = stack.with_dependencies().insert_into(df)
print(df)
#    int1  int2  float1  2 x float1  2 x int1  2 x int2
# 0     3    15     2.2         4.4         6        30
# 1     5    20     0.1         0.2        10        40
# 2     7    50     5.5        11.0        14       100

Pre-Built Features

Input Features

BoolInputFeature: boolean

IntInputFeature: integer

FloatInputFeature: floating point

DateTimeInputFeature: datetime

DateInputFeature: date

StringInputFeature: string

StringTimestampInputFeature: string encoding a timestamp readable via pandas.to_datetime, or according a specific [format](https://docs.python.org/3.7/library/datetime.html#strftime-strptime-behavior)

CountryCodeInputFeature: two-digit country code (e.g. DE, IT, FR, ES)

One-Component Features

OneComponentFeature: define a new feature starting from another one, simply by specifying a lambda function

RenamedFeature: rename a feature column

DateTimeFromStringFeature: extract the datetime from a string which encodes a timestamp

DateFromStringFeature: extract the date from a string which encodes a timestamp

MonthFromDateFeature: extract the month from a date-like object

WeekdayFromDateFeature: extract the weekday from a date-like object (0=Monday, 6=Sunday)

Two-Component Features

TwoComponentFeature: define a new feature starting from two others, simply by specifying a lambda function

DurationFeature: given a start datetime and an end datetime, compute the duration

Multi-Component Features:

MultiComponentFeature: define a new feature starting from multiple other ones, simply by specifying a lambda function

Composed Features

MeanValueForKeyFeature: given a column with keys and one with values, aggregate the values according to the keys and compute their averages. Finally assign the averages to the new column, according to the keys.

Why You Should Use This Library

data verification is a rather painful and tricky task. This library can help in many ways:
1. make you think about it
2. let you use some checks that others already used that can help you identify issues, like missing columns, presence of NaN, and wrong data format
3. how many times did it happen that you check the data and they seem ok, but then you modify them somehow, don’t check them again (because what should have changed?) but something goes wrong? With this library you build a stack that let you make this verification in a very simple manner, avoiding these situations.
often features are built one on top of another creating a rather complicated tree of dependencies that can be annoying to manage manually. This library lets you define the features structure, and then take care of everything for you.
think for a moment about how many times people wrote again and again the same verification code for a feature, or the code to generate one. And how many times stupid mistakes led to a big waste of time? The idea of this open source library is to avoid this.
using this library will force you to a separation of concepts. Using it, your code will look cleaner.

Developers should know

Set up the right PYTHONPATH:
```
export PYTHONPATH=$(pwd)/src
```

To setup a new virtualenv starting from the Pipfile:

pipenv install                    # create the virtualenv
pipenv shell                      # activate the virtualenv

If you want to update the environment using the Pipfile.lock run:

pipenv sync                       # activate the virtualenv

If you need to install a new python package, use pipenv instead of pip.
To run the tests, execute:
```
python -m unittest discover tests
```

Making a release

rm -rf dist && mkdir dist
pipenv run python setup.py sdist bdist_wheel bdist_egg bdist

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.1

Sep 30, 2021

2.0.0

Feb 4, 2021

1.0.1

Sep 29, 2020

1.0.0

Sep 28, 2020

0.3.0

Feb 26, 2020

0.2.9

Feb 25, 2020

0.2.8

Feb 13, 2020

0.2.7

Feb 13, 2020

This version

0.2.6

Nov 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

features_factory-0.2.6.tar.gz (64.5 kB view hashes)

Uploaded Nov 21, 2019 Source

Built Distributions

features_factory-0.2.6-py3.7.egg (23.1 kB view hashes)

Uploaded Nov 21, 2019 Source

features_factory-0.2.6-py2.py3-none-any.whl (24.6 kB view hashes)

Uploaded Nov 21, 2019 Python 2 Python 3

Hashes for features_factory-0.2.6.tar.gz

Hashes for features_factory-0.2.6.tar.gz
Algorithm	Hash digest
SHA256	`d4b97dc047e1314a2f7fdd3076be956f81e14ad72b770a7ee44835c7b4c87915`
MD5	`67f720c2ec140a4699f8f1ca0ccbf8c9`
BLAKE2b-256	`4bb39c9f2a70991dc75bb590e74d7f22b60d477481285b9c809ba75e55905a7b`

Hashes for features_factory-0.2.6-py3.7.egg

Hashes for features_factory-0.2.6-py3.7.egg
Algorithm	Hash digest
SHA256	`444964366804eb75e851bc762f6abd621c61e8219a7db5906d4604225a826b5f`
MD5	`46383204a97a41ab921500c55ca7c0a1`
BLAKE2b-256	`0dd4f2c67fc38022f161afe83571a23d691e70df6e4656b5d2c9b444a2c4f096`

Hashes for features_factory-0.2.6-py2.py3-none-any.whl

Hashes for features_factory-0.2.6-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e01e60b863ab1c9fafbd99f179382517b03c6bd5894a1cc3328990aaba3d2c1`
MD5	`ce6cd36cc43c03f8edcdd83355b58675`
BLAKE2b-256	`329a0750566d0c053cf700baca73e13172e5c4963c1b9e1ca64e586f606e38d9`