Skip to main content

Data augmentation with the mixup method

Project description

mixupy

PyPI Version Build Status Code Coverage Code Quality

mixupy is a python package for data-augmentation inspired by mixup: Beyond Empinical Risk Minimization

If you like mixupy, give it a star, or fork it and contribute!

Usage

Create additional training data for the iris dataset:

import numpy as np
import pandas as pd
from mixupy import mixup

# Use 'iris' dataset from seaborn package
import seaborn as sns
iris = sns.load_dataset('iris')

# One-hot encode species column
iris_df = pd.get_dummies(iris, columns=['species'], prefix='', prefix_sep='')
iris_df

# Strictly speaking this is 'input mixup' (see Details section below)
set.seed(42)
iris_mix = mixup(iris_df)
iris_mix.describe()
iris_df.describe()

# Further info
help(mixup)

Installation

pip install mixupy

Requires python 3.7 or higher plus pandas and numpy

pip install numpy pandas

Details

The mixup function enlarges training sets using linear interpolations of features and associated labels as described in https://arxiv.org/abs/1710.09412.

Virtual feature-target pairs are produced from randomly drawn feature-target pairs in the training data. The method is straight-forward and data-agnostic. It should result in a reduction of generalisation error.

mixup constructs additional training examples:

x' = λ * x_i + (1 - λ) * x_j, where x_i, x_j are raw input vectors

y' = λ * y_i + (1 - λ) * y_j, where y_i, y_j are one-hot label encodings

(x_i, y_i) and (x_j ,y_j) are two examples drawn at random from the training data, and λ ∈ [0, 1] with λ ∼ Beta(α, α) for α ∈ (0, ∞). The mixup hyper-parameter α controls the strength of interpolation between feature-target pairs.

mixup() parameters

Parameter Description Type Notes
data Original data pandas data frame Required parameter
alpha Hyperparameter specifying strength of interpolation numeric Defaults to 4
concat Concatenate mixup data with original data boolean Defaults to False
batch_size How many mixup values to produce integer Defaults to number of 'data' examples

The 'data' parameter must be a numeric (integers and/or floats) pandas data frame. Non-finite values are not permitted. Categorical variables should be one-hot encoded.

Alpha values must be greater than or equal to zero. Alpha equal to zero specifies no interpolation.

The mixup function returns a pandas data frame containing interpolated values. Optionally, the original values can be concatenated with the new values using the concat = True option.

Mixup with deep learning versus other learning methods

It is worthwhile distinguishing between mixup usage with deep learning and other learning methods. Mixup with deep learning can improve generalisation when a new mixed dataset is generated every epoch or even better for every minibatch. This level of granularity may not be possible with other learning methods. For example, simple linear modeling may not benefit much from training on a single (potentially greatly expanded) pre-mixed dataset. This single pre-mixed dataset approach is sometimes referred to as 'input mixup'.

In certain situations, under-fitting can occur when conflicts between synthetic labels of the mixup examples and labels of the original training data are present. Some learning methods may be more prone to this under-fitting than others.

Data augmentation as regularisation

Data augmentation is occasionally referred to as a regularisation technique. Regularisation decreases a model's variance by adding prior knowledge (sometimes using shrinkage). Increasing training data (using augmentation) also decreases a model's variance. Data augmentation is also a form of adding prior knowledge to a model.

Citing

If you use mixup in a scientific publication, then consider citing the original paper:

mixup: Beyond Empirical Risk Minimization

By Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz

https://arxiv.org/abs/1710.09412

I have no affiliation with MIT, FAIR or any of the authors.

Roadmap

  • Improve docs
    • Add before and after mixup plots for iris data
    • Add more detailed examples
      • Different data types e.g. image, temporal etc
      • Different parameters
  • Add my time series mixup variant
  • Add label preserving option
  • Add support for mixing within the same class
    • Usually doesn't perform as well as mixing within all classes
    • May still have some utility e.g. unbalanced data sets

Alternatives

Other implementations:

See Also

Discussion:

Closely related research:

Loosely related research:

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

GPL-3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mixupy-0.1.2.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mixupy-0.1.2-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file mixupy-0.1.2.tar.gz.

File metadata

  • Download URL: mixupy-0.1.2.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.24.0

File hashes

Hashes for mixupy-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ecb656c5fe9e594cc9c4bbdacb753f8a55b4c475990e3250d6100ba77cac8e93
MD5 ad59a0ef4b6b046fa57bda7cd77176c5
BLAKE2b-256 1edf19348ad703b5aceabe05dd557caf6e218962e784b19dfb08c17cd352ff51

See more details on using hashes here.

File details

Details for the file mixupy-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mixupy-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.24.0

File hashes

Hashes for mixupy-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 871caf132e788f0d67a9efa496a8bf86ac43603110ace8508e42d32cb5ecc372
MD5 344f91a1d1598add419be0c20061de02
BLAKE2b-256 1cb2268ac52254a4039986b6700ecdbaec5fecc2832fe2a71490dc62d5e50afe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page