Stochastic gradient descent algorithms
Project description
pysgd
=====
The ``pysgd`` package contains a function that performs various
stochastic gradient descent algorithms. The function accepts data, an
objective function, a gradient descent adaptation and algorithm
hyperparameters as its arguments.
The intention of this package is to present reasonably efficient,
working algorithms that are easy to understand.
The code is structured with one main function in the main module,
``sgd``, one class, ``Objective``, and modules for adaptations and
objective functions. ``sgd`` creates an ``Objective`` instance using the
adaptation and objective function modules and any data specified. The
objective modules include both gradient and cost functions. The gradient
functions accept ``theta`` and ``batch`` as arguments and the cost
functions accept ``theta`` and ``data``. The adaptations are set up as
decorated generator functions that accept ``params`` and ``grad_fun`` as
arguments. ``params`` is a dict of hyperparameters and ``grad_fun`` is a
gradient function. ``Objective`` creates a gradient generator and a cost
function (injecting batch generator and data as required) that each
accept ``theta`` as a single argument. These functions are then used in
the main gradient descent algorithm in ``sgd``.
The package is structured to facilitate the inclusion of additional
algorithms with minimal additional boilerplate. Additional objective
functions and gradient adaptations can be added by following the basic
form of the included ones and placing them in their respective folders.
::
pysgd
 ` __init__.py
 adaptations
  __init__.py
  adagrad.py
  adam.py
 ` constant.py
 objectives
  __init__.py
  linear.py
  logistic.py
 ` stab_tang.py
` tests
Gradient Descent
~~~~~~~~~~~~~~~~
Gradient descent is a method for minimizing an objective function. In
machine learning applications the objective function to be minimized is
the error (or cost), ``J``, of a predictive model. A predictive model
consists of a parameters, ``theta``, that are applied to inputs, ``X``,
(also called training samples, features, observations or independent
variables) in order to estimate an output, ``y_hat`` (also called a
label or dependent variable). Gradient descent attempts to determine the
parameters that when applied to a set of inputs result in the lowest
total error (the difference between the actual outcome and the one
predicted by the model). Below is the basic predictive formula.
``H(X,theta) = y_hat``
And here is an illustrative formula for determining the total error of a
model.
``J(theta) = sum(h_i(theta,x_i)  y_i for each training observation, i)``
Different formulas for computing cost are used depending on the
application, but the formula above expresses the essence of predicting
actual outcomes as closely as possible.
In order to minimze ``J`` with respect to ``theta``, the algorithm
starts with an abitrary value of ``theta``, determines the "direction"
that would result in the fastest decrease in cost (called the
``gradient``), updates ``theta`` in that direction by a small amount
(called the learning rate or ``alpha``) and then repeats until cost
``J`` has been minimized.
``theta_(j+1) = theta_j  alpha * gradient_j``
API
~~~
The package has one main function, ``sgd``, that returns a ``j x (n+2)``
array, where ``j`` is the number of iterations and ``n`` is the number
of features. ``theta_j`` is in the first ``n+1`` columns and the cost
``J_j`` in the last column.
\Argument \Definition \
\\\
\\ ``theta0`` \Starting value of ``theta`` in the form of an
``1 x (n+1)`` array. \ \\ ``obj='stab_tang'`` \Objective function to
be minimized in the form of a string with a value of ``stab_tang``,
``linear`` or ``logistic``. ``stab_tang`` is for the `StablinskyTang
function <https://en.wikipedia.org/wiki/Test_functions_for_optimization>`__,
included for testing and illustrative purposes. \
\\ ``adapt='constant'`` \Gradient descent adaptation in the form of a
string with a value of ``constant``, ``adagrad`` or ``adam``.
.. raw:: html
<ul>
.. raw:: html
<li>
``constant`` applies no adaptation
.. raw:: html
</li>
.. raw:: html
<li>
``adagrad`` implements `Adaptive Gradient
Algorithm <http://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf>`__
.. raw:: html
</li>
.. raw:: html
<li>
``adam`` implements `Adaptive Moment
Estimation <https://arxiv.org/pdf/1412.6980v8.pdf>`__
.. raw:: html
</li>
.. raw:: html
</ul>
::

\\ ``data=np.array.(closed brackets)``\ \Data in the form of an
``m x (n+1)`` array, including ``ones`` in the first column, if
necessary, where ``m`` is the number of training observations. \
\\ ``size=50`` \Batch size in the form of an integer between ``1`` and
``m``. Batches are generated contiguously over the data until theta has
converged or all observations have been included in a batch, at which
point the data is shuffled before additional batches are used.\
\\ ``alpha=.01`` \Learning rate ``alpha`` in the form of a floating
point integer. \ \\ ``epsilon=10**8`` \Hyperparameter used by
``adagrad`` and ``adam`` for smoothing. \ \\ ``beta1=0.9``
\Hyperparamter used by ``adam`` that controls the decay rates of the
moving gradient averages. \ \\ ``beta2=0.999`` \Hyperparamter used by
``adam`` that controls the decay rates of the moving gradient averages.
\ \\ ``delta_min=10**6`` \Maximum change in all elements of
``theta`` required to establish convergence, in the form of a floating
point integer.\ \\ ``iters=1000`` \Maximum number of batches to
evaluate if convergence is not achieved in fewer iterations. \
Tests
^^^^^
Tests are are performed with
`pytest <http://doc.pytest.org/en/latest/index.html>`__ and cover 100%
of the code.
In addition to sample data sets, the `StablinskyTang
function <https://en.wikipedia.org/wiki/Test_functions_for_optimization>`__
is used for testing. This function is nonconvex with straightforward
gradient computations that makes it possible to compare the values
produced by the algorithms with known values. By using two dimensional
inputs all of the possible values of ``J`` within a range of ``theta``
values can be plotted as a surface. ``J(theta)`` for each iteration of
the algorithm can then be plotted on the surface in order to visualize
the gradient descent.
The color scale of the surfaces in the plots in the included notebook
corresponds to the value of ``J``. The color scale of the points on the
surface, which represent ``J_(theta_j)`` at each iteration, corresponds
to the iteration of the algorithm.
=====
The ``pysgd`` package contains a function that performs various
stochastic gradient descent algorithms. The function accepts data, an
objective function, a gradient descent adaptation and algorithm
hyperparameters as its arguments.
The intention of this package is to present reasonably efficient,
working algorithms that are easy to understand.
The code is structured with one main function in the main module,
``sgd``, one class, ``Objective``, and modules for adaptations and
objective functions. ``sgd`` creates an ``Objective`` instance using the
adaptation and objective function modules and any data specified. The
objective modules include both gradient and cost functions. The gradient
functions accept ``theta`` and ``batch`` as arguments and the cost
functions accept ``theta`` and ``data``. The adaptations are set up as
decorated generator functions that accept ``params`` and ``grad_fun`` as
arguments. ``params`` is a dict of hyperparameters and ``grad_fun`` is a
gradient function. ``Objective`` creates a gradient generator and a cost
function (injecting batch generator and data as required) that each
accept ``theta`` as a single argument. These functions are then used in
the main gradient descent algorithm in ``sgd``.
The package is structured to facilitate the inclusion of additional
algorithms with minimal additional boilerplate. Additional objective
functions and gradient adaptations can be added by following the basic
form of the included ones and placing them in their respective folders.
::
pysgd
 ` __init__.py
 adaptations
  __init__.py
  adagrad.py
  adam.py
 ` constant.py
 objectives
  __init__.py
  linear.py
  logistic.py
 ` stab_tang.py
` tests
Gradient Descent
~~~~~~~~~~~~~~~~
Gradient descent is a method for minimizing an objective function. In
machine learning applications the objective function to be minimized is
the error (or cost), ``J``, of a predictive model. A predictive model
consists of a parameters, ``theta``, that are applied to inputs, ``X``,
(also called training samples, features, observations or independent
variables) in order to estimate an output, ``y_hat`` (also called a
label or dependent variable). Gradient descent attempts to determine the
parameters that when applied to a set of inputs result in the lowest
total error (the difference between the actual outcome and the one
predicted by the model). Below is the basic predictive formula.
``H(X,theta) = y_hat``
And here is an illustrative formula for determining the total error of a
model.
``J(theta) = sum(h_i(theta,x_i)  y_i for each training observation, i)``
Different formulas for computing cost are used depending on the
application, but the formula above expresses the essence of predicting
actual outcomes as closely as possible.
In order to minimze ``J`` with respect to ``theta``, the algorithm
starts with an abitrary value of ``theta``, determines the "direction"
that would result in the fastest decrease in cost (called the
``gradient``), updates ``theta`` in that direction by a small amount
(called the learning rate or ``alpha``) and then repeats until cost
``J`` has been minimized.
``theta_(j+1) = theta_j  alpha * gradient_j``
API
~~~
The package has one main function, ``sgd``, that returns a ``j x (n+2)``
array, where ``j`` is the number of iterations and ``n`` is the number
of features. ``theta_j`` is in the first ``n+1`` columns and the cost
``J_j`` in the last column.
\Argument \Definition \
\\\
\\ ``theta0`` \Starting value of ``theta`` in the form of an
``1 x (n+1)`` array. \ \\ ``obj='stab_tang'`` \Objective function to
be minimized in the form of a string with a value of ``stab_tang``,
``linear`` or ``logistic``. ``stab_tang`` is for the `StablinskyTang
function <https://en.wikipedia.org/wiki/Test_functions_for_optimization>`__,
included for testing and illustrative purposes. \
\\ ``adapt='constant'`` \Gradient descent adaptation in the form of a
string with a value of ``constant``, ``adagrad`` or ``adam``.
.. raw:: html
<ul>
.. raw:: html
<li>
``constant`` applies no adaptation
.. raw:: html
</li>
.. raw:: html
<li>
``adagrad`` implements `Adaptive Gradient
Algorithm <http://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf>`__
.. raw:: html
</li>
.. raw:: html
<li>
``adam`` implements `Adaptive Moment
Estimation <https://arxiv.org/pdf/1412.6980v8.pdf>`__
.. raw:: html
</li>
.. raw:: html
</ul>
::

\\ ``data=np.array.(closed brackets)``\ \Data in the form of an
``m x (n+1)`` array, including ``ones`` in the first column, if
necessary, where ``m`` is the number of training observations. \
\\ ``size=50`` \Batch size in the form of an integer between ``1`` and
``m``. Batches are generated contiguously over the data until theta has
converged or all observations have been included in a batch, at which
point the data is shuffled before additional batches are used.\
\\ ``alpha=.01`` \Learning rate ``alpha`` in the form of a floating
point integer. \ \\ ``epsilon=10**8`` \Hyperparameter used by
``adagrad`` and ``adam`` for smoothing. \ \\ ``beta1=0.9``
\Hyperparamter used by ``adam`` that controls the decay rates of the
moving gradient averages. \ \\ ``beta2=0.999`` \Hyperparamter used by
``adam`` that controls the decay rates of the moving gradient averages.
\ \\ ``delta_min=10**6`` \Maximum change in all elements of
``theta`` required to establish convergence, in the form of a floating
point integer.\ \\ ``iters=1000`` \Maximum number of batches to
evaluate if convergence is not achieved in fewer iterations. \
Tests
^^^^^
Tests are are performed with
`pytest <http://doc.pytest.org/en/latest/index.html>`__ and cover 100%
of the code.
In addition to sample data sets, the `StablinskyTang
function <https://en.wikipedia.org/wiki/Test_functions_for_optimization>`__
is used for testing. This function is nonconvex with straightforward
gradient computations that makes it possible to compare the values
produced by the algorithms with known values. By using two dimensional
inputs all of the possible values of ``J`` within a range of ``theta``
values can be plotted as a surface. ``J(theta)`` for each iteration of
the algorithm can then be plotted on the surface in order to visualize
the gradient descent.
The color scale of the surfaces in the plots in the included notebook
corresponds to the value of ``J``. The color scale of the points on the
surface, which represent ``J_(theta_j)`` at each iteration, corresponds
to the iteration of the algorithm.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size pysgd0.0.1a2py2.py3noneany.whl (13.8 kB)  File type Wheel  Python version py2.py3  Upload date  Hashes View 
Filename, size pysgd0.0.1a2.tar.gz (7.8 kB)  File type Source  Python version None  Upload date  Hashes View 
Close
Hashes for pysgd0.0.1a2py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  bfb72e1c81afc46c0a6d10fcb3001b377d73cfbce171e6d6d50b66fa6f0bf605 

MD5  0dc4ed4d18597a69cbd8e8f2ec40cbcf 

BLAKE2256  51cd36522ed8161f46ae6ea874d4c72eeb0119039f671f447a88c3d85341cf62 