Mixed-type multivariate time series modeling with generative adversarial networks.
Project description
An open source project from Data to AI Lab at MIT.
DeepEcho
- License: MIT
- Development Status: Pre-Alpha
- Homepage: https://github.com/sdv-dev/DeepEcho
Overview
DeepEcho is a Synthetic Data Generation Python library for mixed-type, multivariate time series. It provides:
- Multiple models based both on classical statistical modeling of time series and the latest in Deep Learning techniques.
- A robust benchmarking framework for evaluating these methods on multiple datasets and with multiple metrics.
- Ability for Machine Learning researchers to submit new methods following our
model
andsample
API and get evaluated.
Try it out now!
If you want to quickly discover DeepEcho, simply click the button below and follow the tutorials!
Join our Slack Workspace
If you want to be part of the SDV community to receive announcements of the latest releases, ask questions, suggest new features or participate in the development meetings, please join our Slack Workspace!
Install
Requirements
DeepEcho has been developed and tested on Python 3.6, 3.7 and 3.8
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where DeepEcho is run.
Install with pip
The easiest and recommended way to install DeepEcho is using pip:
pip install deepecho
This will pull and install the latest stable release from PyPi.
If you want to install from source or contribute to the project please read the Contributing Guide.
Quickstart
In this short quickstart, we show how to learn a mixed-type multivariate time series dataset and then generate synthetic data that resembles it.
We will start by loading the data and preparing the instance of our model.
from deepecho import PARModel
from deepecho.demo import load_demo
# Load demo data
data = load_demo()
# Define data types for all the columns
data_types = {
'region': 'categorical',
'day_of_week': 'categorical',
'total_sales': 'continuous',
'nb_customers': 'count',
}
model = PARModel(cuda=False)
If we want to use different settings for our model, like increasing the number of epochs or enabling CUDA, we can pass the arguments when creating the model:
model = PARModel(epochs=1024, cuda=True)
Notice that for smaller datasets like the one used on this demo, CUDA usage introduces more overhead than the gains it obtains from parallelization, so the process in this case is more efficient without CUDA, even if it is available.
Once we have created our instance, we are ready to learn the data and generate new synthetic data that resembles it:
# Learn a model from the data
model.fit(
data=data,
entity_columns=['store_id'],
context_columns=['region'],
data_types=data_types,
sequence_index='date'
)
# Sample new data
model.sample(num_entities=5)
The output will be a table with synthetic time series data with the same properties to the demo data that we used as input.
What's next?
For more details about DeepEcho and all its possibilities and features, please check and run the tutorials.
If you want to see how we evaluate the performance and quality of our models, please have a look at the DeepEcho Benchmarking framework or Explore the obtained results
Also, please feel welcome to visit our contributing guide in order to help us developing new features or cool ideas!
Related Projects
SDV
SDV, for Synthetic Data Vault, is the end-user library for synthesizing data in development under the HDI Project. SDV allows you to easily model and sample relational datasets using DeepEcho thought a simple API. Other features include anonymization of Personal Identifiable Information (PII) and preserving relational integrity on sampled records.
CTGAN
CTGAN is a GAN based model for synthesizing tabular data. It's also developed by the MIT's Data to AI Lab and is under active development.
History
0.1.2 (2020-09-15)
Add BasicGAN Model and additional benchmarking results.
0.1.1 (2020-08-15)
This release includes a few new features to make DeepEcho work on more types of datasets as well as to making it easier to add new datasets to the benchmarking framework.
- Add
segment_size
andsequence_index
arguments tofit
method. - Add
sequence_length
as an optional argument tosample
andsample_sequence
methods. - Update the Dataset storage format to add
sequence_index
and versioning. - Separate the sequence assembling process in its own
deepecho.sequences
module. - Add function
make_dataset
to create a dataset from a dataframe and just a few column names. - Add notebook tutorial to show how to create a datasets and use them.
0.1.0 (2020-08-11)
First release.
Included Features:
- PARModel
- Demo dataset and tutorials
- Benchmarking Framework
- Support and instructions for benchmarking on a Kubernetes cluster.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for deepecho-0.1.3.dev0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa1fb1b3840a8ae74e77f3a4cc87bfab71648285adcce08ac943ae3537e205ad |
|
MD5 | 46eb04cec910baf28dd887d076bc21a5 |
|
BLAKE2b-256 | 545c7316e1508f3cece2b5405c209b3b0bc8f441f3365326e425c948443eddff |