A library to enhance your feature engineering workflow
Project description
Kodiak
======
.. image:: https://img.shields.io/github/license/mashape/apistatus.svg
.. image:: https://readthedocs.org/projects/kodiak/badge/?version=latest
:target: http://kodiak.readthedocs.io/en/latest/?badge=latest
Overview
--------
**Kodiak** enhances your feature engineering workflow extracting common
patterns so you can create more features faster.
Ex: You have the ``writers`` dataframe, where ``born`` is a datetime
+-----------------------+--------------+
| name | born |
+=======================+==============+
| Miguel de Cervantes | 09-29-1547 |
+-----------------------+--------------+
| William Shakespeare | 04-23-1617 |
+-----------------------+--------------+
and you want to extract from the ``born`` column: ``day``, ``month`` and
``year`` and create 3 new columns
+-----------------------+--------------+---------------+-------------+--------------+
| name | born | born\_month | born\_day | born\_year |
+=======================+==============+===============+=============+==============+
| Miguel de Cervantes | 09-29-1547 | 9 | 29 | 1547 |
+-----------------------+--------------+---------------+-------------+--------------+
| William Shakespeare | 04-23-1617 | 4 | 23 | 1617 |
+-----------------------+--------------+---------------+-------------+--------------+
The simplest thing you could do in Pandas is:
.. code:: python
writers["born_month"] = writers.born.map(lambda x: x.month)
writers["born_day"] = writers.born.map(lambda x: x.day)
writers["born_year"] = writers.born.map(lambda x: x.year)
With Kodiak you could get the same result in one line:
.. code:: python
writers.gencol("born_{month,day,year}", "born", lambda x, y: getattr(x, y))
# or more succinctly
writers.gencol("born_{.month,.day,.year}", "born")
But, how does it work? Kodiak uses ``"born_{month,day,year}"`` as a
template for the columns: ``born_month``, ``born_day`` and ``born_year``
and passes ``month``,\ ``day`` and ``year`` as arguments to a provided
function that also has the current 'born' as an an argument, so you're
basically doing:
.. code:: python
for y in ['month', 'day', 'year']:
writers["born_{}".format(y)] = writers.born.map(lambda x: getattr(x, y))
Kodiak does a lot of other things to boost your workflow, for that, see
the *Basic Usage* and *Advanced Usage* sections
Installation
------------
To install Kodiak, ``cd`` to the Kodiak folder and run the install
command:
.. code:: sh
sudo python setup.py install
You can also install Kodiak from PyPI:
.. code:: sh
sudo pip install kodiak
Basic Usage
-----------
**Kodiak** main object is ``KodiakDataFrame`` an extension of
``pandas.Dataframe`` that provides the instance method ``colgen`` to
create one or more columns. You create ``KodiakDataFrames`` the same way
you create a ``pandas.DataFrame``
``colgen`` signature is:
``colgen(newcols, col, colbuilder=None, enum=False, config=None)``
newcols
has a double function, it works as a specification of the columns you
want to create, and it also contains the values passed to ``colbuilder``
.. code:: python
# If you want to create the columns `first_name`, `last_name`
# and pass `first` and `last` as arguments to `colbuilder` you write
>>> newcols = "{first,last}_name"
# More complex patterns allowed
>>> groups = "col_{a,b}_{c,d}"
# Will create the columns: `col_a_c`, `col_b_d`
# The way `a,b` and `c,d` is combined can be configured
col
is the `KodiakDataframe` column from where you'll extract information
to create your new column/s
colbuilder
is a function or lambda used to extract info from ``col``
and create the columns specified in ``newcols`` with the
corresponding ``col`` instance and the ``newcols`` values.
The signature of `colbuilder` is `colbuilder(x, y)` or
`colbuilder(i, x, y)` `x` is an instance of the column passed
in `col` and `y` is an argument extracted from `newcols`. The
extra argument `i` is an index of the arguments.
config
tweak Kodiak inner workings with your own config, see the
dedicated section for more info
Advanced Usage
--------------
In this section we're going to describe the main components and concepts
that are essential to Kodiak
Templating
~~~~~~~~~~
The template language is minimal but has some extensions to help you:
Ranges
^^^^^^
The range notation is ``start:end:step``. Reverse ranges are permitted
setting ``end`` bigger than ``start``. ``step`` default value is ``1``, and
``start`` is ``0``, finally if ``end`` is absent, it'll be setted to ``0`` and
you'll have a reversed range. Ranges are inclusive.
.. code:: python
simple_range = "col_{1:3}" # -> col_1, col_2, col_3
step_range = "col_{:3:2}" # -> col_0, col_3
inverse_range = "col_{3:1}" # -> col_3,col_2,col_1
no_end = "2:" # -> col_2,col_1,col_0
Key-Value
^^^^^^^^^
If you want the column name and argument passed to the ``colbuilder`` to
differ you can use key-values.
.. code:: python
dataframe.gencol("{short=very_long_name}_col", "col", alambda)
# In this case the column name will be ``short_col`` but you'll pass
# ``very_long_name`` to ``alambda``
# key-value notation can be extended to more arguments:
dataframe.gencol("{k1=v1,k2=v2,k3=v3}_col", "col", alambda)
.. WARNING::
values are always interpreted as *strings* so in:
``col_{k=1:5}`` the value ``1:5`` is interpreted as ``"1:5"`` and not as
a range, the same for ``col_{k=[1,2,3]}`` and any other object, also if
you pass a number it will also be interpreted as string so you will need
to convert it if you intend to use it as an ``int``.
Transforms
~~~~~~~~~~
Under the hood when you pass ``newcols`` to ``gencol``, Kodiak builds an
``OrderedDict`` where it's keys are column names and it's values are
tuples of ``Match`` objects -even if it's just one Match it's wrapped
inside a tuple-
.. code:: python
newcol = "{first,last}_name"
# will build
args_dict = {'first_name': (Match('first'),), 'last_name': (Match('last'),)}
``Transforms`` are a way to pre-process the values and change them
enriching the ``Match`` object with a payload as you will see in the
``Default colbuilder`` section.
So, if the values are ``Match`` objects, how is that when you write your
``colbuilder`` you deal with ``strings``? Kodiak understands that if the
``Match`` object doesn't have a payload it's better to pass strings
arguments to ``colbuilder``, this behaviour can be controlled.
What's the use of ``Match`` objects and their ``payload``? What're some
examples of ``Transforms``? The next section will answer this questions
Default colbuilder
~~~~~~~~~~~~~~~~~~
As you can see in the ``colgen`` signature, ``colbuilder`` default
argument is ``None``, in special cases Kodiak can infer the
``colbuilder`` method, let's revisit the opening example.
.. code:: python
writers.gencol("born_{.month,.day,.year}", "born")
The ``colbuilder`` in this case is inferred from the hint you gave
Kodiak in the template: ``.month``, prefixing ``month`` with a ``.``
indicates that you're referring to an attribute of ``born``, so
internally Kodiak builds a ``colbuider`` that extracts the ``month``
from a ``born`` instance. Another way of omitting the ``colbuilder`` is
when you have an instance method:
.. code:: python
# Notice the `!` after weekday
writers.gencol("born_{weekday!}", "born")
.. WARNING::
This hint only works for methods with no arguments, passing
a method with one or more arguments will raise an error
How is that Kodiak infers the ``colbuilder``? When the ``newcols`` are
processed they go through a pipeline of ``Transforms``, one of them:
``PropertyTransform`` detects that ``.month`` refers to an attribute and
enriches de ``Match`` object hinting in the payload the corresponding
``colbuilder``, that's why you don't need to pass the ``colbuilder``
argument. But what happens if you give a ``colbuilder``? In this case,
as the ``Match`` object has a ``payload`` instead of working with plain
strings you will work with tuples of ``Match`` objects
.. Note::
Kodiak will raise an exception when it can't figure out a
default colbuilder
Enumerations
~~~~~~~~~~~~
Sometimes you care about the position of the arguments not the exact
value, in that case you can use the ``enum`` param or the implicit
``enum`` with a function or lambda of arity 3, the first argument will
be the index starting at 0.
.. code:: python
writers.gencol("{first=0,last=1}_name", "name", lambda x,y: x.split(" ")[int(y)])
# Another way with enum=True
writers.gencol("{first,last}_name", "name", lambda i,x,y: x.split(" ")[i], enum=True)
# Without enum=True but with a colbuilder with arity 3
writers.gencol("{first,last}_name", "name", lambda i,x,y: x.split(" ")[i])
Configuration
-------------
Almost everything is configurable in Kodiak, you could have a per-method
configuration or system-wide config.
The ``Config`` object has the following customizable params:
parser
Kodiak by default uses the ``ArgsParser`` class to parse ``newcols``
match\_transform
data passed to the ``colbuilder`` could be
transformed first, by default we use the ``default_transform`` pipeline,
you could replace it with an array of ``Transforms`` objects.
new\_col\_combiner
in the newcols template if you have
``"col_{a,b}_{c,d}"``, this results in the columns: ``"col_a_c"`` and
``"col_b,d"``, how the different groups ``['a','b']`` and ``['c', 'd']``
are combined is controlled with this param, currently we use the ``zip``
function, and you could replace it with a function with similar
signature.
unpack
Boolean Default True, when ``newcols`` is simple, ``foo_{a,b}``
instead of ``foo_{.a,b!}`` instead of passing to ``colbuilder``
tuples of ``Match`` objects we just pass individual items,
``a``, ``b``, so it's easier to build a ``colbuilder`` without
having to unwrap the ``Match`` tuples
col\_pair\_combiner
Once you have the arguments that you're going to
pass to the ``colbuilder`` you can combine them in different ways, currently
we use ``product`` from itertools, ie: the cartesian product between an
element, ex: ``event``, and the other n-columns, creating the following
tuples:
.. code:: python
[('event', 'day') , ('event', 'month'), ('event', 'year')]
You can replace this method with another with the same signature as ``product``
Config can be accessed, modified and restored with:
.. code:: python
>> import config
>> from config import cfg
>> config.options
# Global change on config
>> config.options["unpack"] = False
>> config.options["col_pair_combiner"] = zip
# Restoring one or more fields of the configuration
>> config.restore_default_config("col_pair_combiner")
# Restoring all the options
>> config.restore_default_config()
# With `base_config` or it's alias `cfg` you can create modified versions
# of the default config
>> dataframe.gencol("col_{a!,b!}","col", func, config=cfg(unpack=False))
=======
History
=======
0.1.0 (2017-09-19)
------------------
* First release on PyPI.
======
.. image:: https://img.shields.io/github/license/mashape/apistatus.svg
.. image:: https://readthedocs.org/projects/kodiak/badge/?version=latest
:target: http://kodiak.readthedocs.io/en/latest/?badge=latest
Overview
--------
**Kodiak** enhances your feature engineering workflow extracting common
patterns so you can create more features faster.
Ex: You have the ``writers`` dataframe, where ``born`` is a datetime
+-----------------------+--------------+
| name | born |
+=======================+==============+
| Miguel de Cervantes | 09-29-1547 |
+-----------------------+--------------+
| William Shakespeare | 04-23-1617 |
+-----------------------+--------------+
and you want to extract from the ``born`` column: ``day``, ``month`` and
``year`` and create 3 new columns
+-----------------------+--------------+---------------+-------------+--------------+
| name | born | born\_month | born\_day | born\_year |
+=======================+==============+===============+=============+==============+
| Miguel de Cervantes | 09-29-1547 | 9 | 29 | 1547 |
+-----------------------+--------------+---------------+-------------+--------------+
| William Shakespeare | 04-23-1617 | 4 | 23 | 1617 |
+-----------------------+--------------+---------------+-------------+--------------+
The simplest thing you could do in Pandas is:
.. code:: python
writers["born_month"] = writers.born.map(lambda x: x.month)
writers["born_day"] = writers.born.map(lambda x: x.day)
writers["born_year"] = writers.born.map(lambda x: x.year)
With Kodiak you could get the same result in one line:
.. code:: python
writers.gencol("born_{month,day,year}", "born", lambda x, y: getattr(x, y))
# or more succinctly
writers.gencol("born_{.month,.day,.year}", "born")
But, how does it work? Kodiak uses ``"born_{month,day,year}"`` as a
template for the columns: ``born_month``, ``born_day`` and ``born_year``
and passes ``month``,\ ``day`` and ``year`` as arguments to a provided
function that also has the current 'born' as an an argument, so you're
basically doing:
.. code:: python
for y in ['month', 'day', 'year']:
writers["born_{}".format(y)] = writers.born.map(lambda x: getattr(x, y))
Kodiak does a lot of other things to boost your workflow, for that, see
the *Basic Usage* and *Advanced Usage* sections
Installation
------------
To install Kodiak, ``cd`` to the Kodiak folder and run the install
command:
.. code:: sh
sudo python setup.py install
You can also install Kodiak from PyPI:
.. code:: sh
sudo pip install kodiak
Basic Usage
-----------
**Kodiak** main object is ``KodiakDataFrame`` an extension of
``pandas.Dataframe`` that provides the instance method ``colgen`` to
create one or more columns. You create ``KodiakDataFrames`` the same way
you create a ``pandas.DataFrame``
``colgen`` signature is:
``colgen(newcols, col, colbuilder=None, enum=False, config=None)``
newcols
has a double function, it works as a specification of the columns you
want to create, and it also contains the values passed to ``colbuilder``
.. code:: python
# If you want to create the columns `first_name`, `last_name`
# and pass `first` and `last` as arguments to `colbuilder` you write
>>> newcols = "{first,last}_name"
# More complex patterns allowed
>>> groups = "col_{a,b}_{c,d}"
# Will create the columns: `col_a_c`, `col_b_d`
# The way `a,b` and `c,d` is combined can be configured
col
is the `KodiakDataframe` column from where you'll extract information
to create your new column/s
colbuilder
is a function or lambda used to extract info from ``col``
and create the columns specified in ``newcols`` with the
corresponding ``col`` instance and the ``newcols`` values.
The signature of `colbuilder` is `colbuilder(x, y)` or
`colbuilder(i, x, y)` `x` is an instance of the column passed
in `col` and `y` is an argument extracted from `newcols`. The
extra argument `i` is an index of the arguments.
config
tweak Kodiak inner workings with your own config, see the
dedicated section for more info
Advanced Usage
--------------
In this section we're going to describe the main components and concepts
that are essential to Kodiak
Templating
~~~~~~~~~~
The template language is minimal but has some extensions to help you:
Ranges
^^^^^^
The range notation is ``start:end:step``. Reverse ranges are permitted
setting ``end`` bigger than ``start``. ``step`` default value is ``1``, and
``start`` is ``0``, finally if ``end`` is absent, it'll be setted to ``0`` and
you'll have a reversed range. Ranges are inclusive.
.. code:: python
simple_range = "col_{1:3}" # -> col_1, col_2, col_3
step_range = "col_{:3:2}" # -> col_0, col_3
inverse_range = "col_{3:1}" # -> col_3,col_2,col_1
no_end = "2:" # -> col_2,col_1,col_0
Key-Value
^^^^^^^^^
If you want the column name and argument passed to the ``colbuilder`` to
differ you can use key-values.
.. code:: python
dataframe.gencol("{short=very_long_name}_col", "col", alambda)
# In this case the column name will be ``short_col`` but you'll pass
# ``very_long_name`` to ``alambda``
# key-value notation can be extended to more arguments:
dataframe.gencol("{k1=v1,k2=v2,k3=v3}_col", "col", alambda)
.. WARNING::
values are always interpreted as *strings* so in:
``col_{k=1:5}`` the value ``1:5`` is interpreted as ``"1:5"`` and not as
a range, the same for ``col_{k=[1,2,3]}`` and any other object, also if
you pass a number it will also be interpreted as string so you will need
to convert it if you intend to use it as an ``int``.
Transforms
~~~~~~~~~~
Under the hood when you pass ``newcols`` to ``gencol``, Kodiak builds an
``OrderedDict`` where it's keys are column names and it's values are
tuples of ``Match`` objects -even if it's just one Match it's wrapped
inside a tuple-
.. code:: python
newcol = "{first,last}_name"
# will build
args_dict = {'first_name': (Match('first'),), 'last_name': (Match('last'),)}
``Transforms`` are a way to pre-process the values and change them
enriching the ``Match`` object with a payload as you will see in the
``Default colbuilder`` section.
So, if the values are ``Match`` objects, how is that when you write your
``colbuilder`` you deal with ``strings``? Kodiak understands that if the
``Match`` object doesn't have a payload it's better to pass strings
arguments to ``colbuilder``, this behaviour can be controlled.
What's the use of ``Match`` objects and their ``payload``? What're some
examples of ``Transforms``? The next section will answer this questions
Default colbuilder
~~~~~~~~~~~~~~~~~~
As you can see in the ``colgen`` signature, ``colbuilder`` default
argument is ``None``, in special cases Kodiak can infer the
``colbuilder`` method, let's revisit the opening example.
.. code:: python
writers.gencol("born_{.month,.day,.year}", "born")
The ``colbuilder`` in this case is inferred from the hint you gave
Kodiak in the template: ``.month``, prefixing ``month`` with a ``.``
indicates that you're referring to an attribute of ``born``, so
internally Kodiak builds a ``colbuider`` that extracts the ``month``
from a ``born`` instance. Another way of omitting the ``colbuilder`` is
when you have an instance method:
.. code:: python
# Notice the `!` after weekday
writers.gencol("born_{weekday!}", "born")
.. WARNING::
This hint only works for methods with no arguments, passing
a method with one or more arguments will raise an error
How is that Kodiak infers the ``colbuilder``? When the ``newcols`` are
processed they go through a pipeline of ``Transforms``, one of them:
``PropertyTransform`` detects that ``.month`` refers to an attribute and
enriches de ``Match`` object hinting in the payload the corresponding
``colbuilder``, that's why you don't need to pass the ``colbuilder``
argument. But what happens if you give a ``colbuilder``? In this case,
as the ``Match`` object has a ``payload`` instead of working with plain
strings you will work with tuples of ``Match`` objects
.. Note::
Kodiak will raise an exception when it can't figure out a
default colbuilder
Enumerations
~~~~~~~~~~~~
Sometimes you care about the position of the arguments not the exact
value, in that case you can use the ``enum`` param or the implicit
``enum`` with a function or lambda of arity 3, the first argument will
be the index starting at 0.
.. code:: python
writers.gencol("{first=0,last=1}_name", "name", lambda x,y: x.split(" ")[int(y)])
# Another way with enum=True
writers.gencol("{first,last}_name", "name", lambda i,x,y: x.split(" ")[i], enum=True)
# Without enum=True but with a colbuilder with arity 3
writers.gencol("{first,last}_name", "name", lambda i,x,y: x.split(" ")[i])
Configuration
-------------
Almost everything is configurable in Kodiak, you could have a per-method
configuration or system-wide config.
The ``Config`` object has the following customizable params:
parser
Kodiak by default uses the ``ArgsParser`` class to parse ``newcols``
match\_transform
data passed to the ``colbuilder`` could be
transformed first, by default we use the ``default_transform`` pipeline,
you could replace it with an array of ``Transforms`` objects.
new\_col\_combiner
in the newcols template if you have
``"col_{a,b}_{c,d}"``, this results in the columns: ``"col_a_c"`` and
``"col_b,d"``, how the different groups ``['a','b']`` and ``['c', 'd']``
are combined is controlled with this param, currently we use the ``zip``
function, and you could replace it with a function with similar
signature.
unpack
Boolean Default True, when ``newcols`` is simple, ``foo_{a,b}``
instead of ``foo_{.a,b!}`` instead of passing to ``colbuilder``
tuples of ``Match`` objects we just pass individual items,
``a``, ``b``, so it's easier to build a ``colbuilder`` without
having to unwrap the ``Match`` tuples
col\_pair\_combiner
Once you have the arguments that you're going to
pass to the ``colbuilder`` you can combine them in different ways, currently
we use ``product`` from itertools, ie: the cartesian product between an
element, ex: ``event``, and the other n-columns, creating the following
tuples:
.. code:: python
[('event', 'day') , ('event', 'month'), ('event', 'year')]
You can replace this method with another with the same signature as ``product``
Config can be accessed, modified and restored with:
.. code:: python
>> import config
>> from config import cfg
>> config.options
# Global change on config
>> config.options["unpack"] = False
>> config.options["col_pair_combiner"] = zip
# Restoring one or more fields of the configuration
>> config.restore_default_config("col_pair_combiner")
# Restoring all the options
>> config.restore_default_config()
# With `base_config` or it's alias `cfg` you can create modified versions
# of the default config
>> dataframe.gencol("col_{a!,b!}","col", func, config=cfg(unpack=False))
=======
History
=======
0.1.0 (2017-09-19)
------------------
* First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kodiak-0.1.0.tar.gz
(22.7 kB
view details)
Built Distribution
File details
Details for the file kodiak-0.1.0.tar.gz
.
File metadata
- Download URL: kodiak-0.1.0.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d66200850e4c84edb5142d907f77828461a4c933eebdfdd4aac4e1135a92889d |
|
MD5 | 506605bf0ef4ea2c2bc7e7e213e66df2 |
|
BLAKE2b-256 | 1cdc4a3bf903323df70a2e02c190e839902de66a6b75c3c519240497bfcb9196 |
File details
Details for the file kodiak-0.1.0-py2.py3-none-any.whl
.
File metadata
- Download URL: kodiak-0.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1046ea84a3582788b3bf621d0243c338ca07a45dda09e77ee2627a2ed74776a |
|
MD5 | c7ee2cc0e966cc62d68915d240560150 |
|
BLAKE2b-256 | 34ac6479dc471ecf95a5ff083148be08246ddc679ef0f9ca6cedb90b34d543e2 |