An embedding toolkit that can perform multiple embedding process which are low-dimensional embedding (dimension reduction), categorical variable embedding, and financial time-series embedding.
Project description
Embedding Tool
An embedding toolkit that can perform multiple embedding process which are low-dimensional embedding (dimension reduction), categorical variable embedding, and financial time-series embedding.
Install
pip install embedding-tool
from embedding_tool.core import *
How to use
Dimension Reduction: dimensionReducer
class
The function performs dimensionality reduction, pre-processing the data and comparing the reconstruction error via PCA and autoencoder.
Input data: The input matrix has a size of 863 $\times$ 768.
print ("Data's size: ", testing_data.shape)
Data's size: (863, 768)
Performing dimension reduction: we will reduce the number of dimension from 768 to 2. The learning rate of 0.002 will be use for the ADAM optimizer for the autoencoder model fitting.
dim_reducer = dimensionReducer(testing_data, 2, 0.002)
dim_reducer.fit()
Calculating the MSE of the reconstructed vectors
dim_reducer.rmse_result
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
PCA | 1AE | 2AE | |
---|---|---|---|
MSE | 0.740122 | 0.741265 | 0.65168 |
dim_reducer.rmse_result.T.sort_values('MSE').head(1).values[0][0]
0.6516801665399286
Here we can see that the two-layers autoencoder has the best performance with the lowest MSE of 0.64.
Observing the loss for each epoch: If we see that the MSE doesn't converge fast enough, we could adjust the learning rate parameter. The default is 0.002. Try increase it to 0.005 if it doesn't converge or decrease to 0.001 if it converges way too fast and oscillating.
dim_reducer.plot_autoencoder_performance()
Result (Reduced Dimension Output): There are three outputs from three different methods, which are PCA, 1-layer AE, and 2-layers AE.
### Embedding from PCA
dim_reducer.dfLowDimPCA.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
0 | 1 | |
---|---|---|
0 | -16.078718 | -6.701481 |
1 | -8.858150 | 9.354204 |
2 | 4.305739 | -0.464707 |
3 | -11.514311 | -0.687461 |
4 | 1.212006 | 6.537965 |
### Embedding from 1-layer autoencoder
dim_reducer.dfLowDim1AE.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
0 | 1 | |
---|---|---|
0 | -6.178097 | 4.734626 |
1 | 2.075333 | 5.529111 |
2 | 0.953502 | -1.667776 |
3 | -2.488155 | 4.001960 |
4 | 3.183654 | 0.589496 |
### Embedding from 2-layers autoencoder
dim_reducer.dfLowDim2AE.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
0 | 1 | |
---|---|---|
0 | 32.622066 | 54.652271 |
1 | 35.649811 | 40.493984 |
2 | 15.314294 | 5.869064 |
3 | 19.667603 | 37.821194 |
4 | 36.183212 | 25.429262 |
Plotting the embedding
### Embedding from 2-layers autoencoder
plot_output(dim_reducer.dfLowDim2AE)
### Embedding from 1-layer autoencoder
plot_output(dim_reducer.dfLowDim1AE)
Reference:
- https://towardsdatascience.com/dimensionality-reduction-pca-versus-autoencoders-338fcaf3297d
- https://towardsdatascience.com/autoencoders-vs-pca-when-to-use-which-73de063f5d7
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for embedding_tool-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad8d2744db7b2ceb5a07648897f34bf89f4550d93e8c988eda8011c4d37dbaf4 |
|
MD5 | 03fe0cea66a3e38cdd5793820071e85b |
|
BLAKE2b-256 | 32676592b842d94e5adefae7bc6ceb1e270194cfafb90377a2ac9d5f960bbf3d |