Griddify high-dimensional tabular data for easy visualization and deep learning
Project description
Griddify
Redistribute tabular data into a grid for easy visualization and image-based deep learning. This library is greatly inspired by the excellent MolMap library.
Installation
git clone https://github.com/ersilia-os/griddify.git
cd griddify
pip install -e .
Note that you may have to install a C++ compiler. You can just use conda for that:
conda install -c conda-forge cxx-compiler
Step by step
Get a multidimensional dataset and preprocess it
In this example, we will use a dataset of 200 physicochemical descriptors calculated for about 10k compounds. You can get these data with the following command.
from griddify import datasets
data = datasets.get_compound_descriptors()
It is important that you preprocess your data (impute missing values, normalize, etc.). We provide functionality to do so.
from griddify import Preprocessing
pp = Preprocessing()
pp.fit(data)
data = pp.transform(data)
Create a 2D cloud of data features
Start by calculating distances between features.
from griddify import FeatureDistances
fd = FeatureDistances(metric="cosine").calculate(data)
You can now obtain a 2D cloud of your data features. By default, UMAP is used.
from griddify import Tabular2Cloud
tc = Tabular2Cloud()
tc.fit(fd)
Xc = tc.transform(fd)
It is always good to inspect the resulting projection. The cloud contains as many points as features exist in your dataset.
from griddify.plots import cloud_plot
cloud_plot(Xc)
Rearrange the 2D cloud onto a grid
Distribute cloud points on a grid using a linear assignment algorithm.
from griddify import Cloud2Grid
cg = Cloud2Grid()
cg.fit(Xc)
Xg = cg.transform(Xc)
You can check the rearrangement with an arrows plot.
from griddify.plots import arrows_plot
arrows_plot(Xc, Xg)
To continue with the next steps, it is actually more convenient to get mappings as integers. The following method gives you the size of the grid as well.
mappings, side = cg.get_mappings(Xc)
Rearrange your flat data points into grids
Let's go back to the original tabular data. We want to transform the input data, where each data sample is represented with a one-dimensional array, into an output data where each sample is represented with an image (i.e. a two-dimensional grid). Please ensure that data are normalize or scaled.
from griddify import Flat2Grid
fg = Flat2Grid(mappings, side)
Xi = fg.transform(data)
Explore one sample.
from griddify.plots import grid_plot
grid_plot(Xi[0])
Full pipeline
You can run the full pipeline described above in only a few lines of code.
from griddify import datasets
from griddify import Griddify
data = datasets.get_compound_descriptors()
gf = Griddify(preprocess=True)
gf.fit(data)
Xi = gf.transform(data)
You can find more examples as Jupyter Notebooks in the notebooks folder.
Learn more
The Ersilia Open Source Initiative is on a mission to strenghten research capacity in low income countries. Please reach out to us if you want to contribute: hello@ersilia.io
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file griddify-0.0.2.tar.gz
.
File metadata
- Download URL: griddify-0.0.2.tar.gz
- Upload date:
- Size: 21.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6f875b1994d9041da87c14ed3d92b294840f9f2c1a5436164ad8eb9e659b51b |
|
MD5 | 4b87422ecaecbbac56211394ec5f7246 |
|
BLAKE2b-256 | 18b236c593e650b7de612c27c45d8dd2e36dd0faa92f5ed52f04d60ec1a1cf9c |
File details
Details for the file griddify-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: griddify-0.0.2-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95dc6efa546ae7cedd9a54aa870e5c8b75e09e9d049e39e415abc5b2908b349c |
|
MD5 | 4160772dda1ed27277c71a8272dfdd06 |
|
BLAKE2b-256 | 4f9624c314535a9376b5e91660ffc36e148d3416f32f038d8637ce0a147ce7d5 |