Function to help Data Scientist work more effectively with DWH
Project description
cads-sdk: Functions to help Data Scientist work more effectively with unstructured data and datalake
What is it?
cads-sdk Function to help Data Scientist work more effectively with unstructured data. Include different function work with image, audio
Main Features
Here are just a few of the things that cads-sdk does well:
- Data pre-processing: Faster upto 25% compare with cv2.imread()
- Image pre-processing, convert from Image Folder/Zipfile to Parquet/delta, ready for training
- Audio pre-processing, convert from Folder Audio, ready for training
- Optimize storage: Does not reduce image quality but optimize memory 12%
- Decrease number of small file
- Take advantage of compression zstd parquet
- View image/audio but not get system down (browser ran out of memory)
Install
Binary installers for the latest released version are available at the Python Package Index (PyPI).
# with PyPI
pip install cads-sdk
Dependencies
- spark-sdk - PySpark, PyArrow add on function
- opencv-python - Wrapper package for OpenCV python bindings
- petastorm - Petastorm is a library enabling the use of Parquet storage from Tensorflow, Pytorch, and other Python-based ML training frameworks
- pandas - Powerful data structures for data analysis, time series, and statistics
Installation from sources
To install cads-sdk from source you need Cython in addition to the normal dependencies above. Cython can be installed from PyPI:
pip install cython
In the cads-sdk
directory (same one where you found this file after
cloning the git repo), execute:
python setup.py install
Documentation
The official documentation is hosted on PyData.org: https://pandas.pydata.org/pandas-docs/stable
Image
Convert a folder image to parquet
from cads_sdk.nosql.converter import ConvertFromFolderImage
converter = ConvertFromFolderImage(
input_path="/path/to/folder/**/*.jpg",
input_type = 'jpg' # 'jpg' | ('jpg', 'png')
input_recursive = True,
#setting output
output_path = f"file:/output/path/image_storage",
# setting converter
image_type = 'jpg',
image_color = 3,
resize_mode=None, # |padding|resize
size = [(212,212),
(597, 597)],
)
converter.execute()
# convert directly from .zip file to parquet
from cads_sdk.nosql.converter import ConvertFromZipImage
converter = ConvertFromZipImage(
input_path="/path/to/image_storage/ETHZ.zip",
input_recursive = True, # will loop through folder to get all pattern
input_type = 'jpg' # 'jpg' | ('jpg', 'png')
#setting output
output_path = f"file:/output/path/img_ethz.parquet",
table_name = 'img_ethz',
database = 'default',
file_format = 'parquet', # delta|parquet
compression = 'zstd', # zstd|snappy
# setting converter
image_type = 'png',
image_color = 3,
resize_mode=None, # |padding|resize
size = [(1080,1920)],
debug = False
)
converter.execute()
Convert a Image parquet file back to Image Folder
from cads_sdk.nosql.converter import ConvertToFolderImage
converter = ConvertToFolderImage(
input_path = '/user/username/image/img_user_device_jpg_212_212.parquet',
raw_input_path = "/home/username/image_storage/device_images/**/*.jpg",
output_path = './abc/'
)
converter.execute()
Function to read image
from cads.nosql import display
import cads_sdk as ss
df = ss.sql("""select * from parquet.`/user/duyvnc/image/img_images_jpg_212_212.parquet`""")
pdf = df.toPandasImage(limit=50)
pdf
pdf = ss.sql("""
select *
from parquet.`file:/home/duyvnc/image_storage/img_mot17_1080_1920.parquet`
limit 100
""").toPandasImage(mode='BGR')
Pytorch API
from cads_sdk.nosql import codec
from petastorm import make_reader, TransformSpec
from petastorm.pytorch import DataLoader
num_epochs = 10
with DataLoader(reader=make_reader('{}/train'.format(dataset_url), reader_pool_type='dummy', num_epochs=num_epochs,
transform_spec=transform), batch_size=32) as train_loader:
train(model, device, train_loader, 2000, optimizer, num_epochs)
Audio
Suport pcm, mp3, wav format
Convert a folder audio to parquet
from cads_sdk.nosql.converter import ConvertFromFolderImage
converter = ConvertFromFolderAudio(
input_path='/path/to/audio_wav/*.wav', #(1)
input_type = 'wav' # 'wav'| 'mp3' | 'pcm' ('wav', 'mp3')
input_recursive = False,
output_path = f"file:/output/path/audio_wav.parquet",
)
converter.execute()
Convert a parquet to folder audio
converter = ConvertToFolderAudio(
input_path = 'file:/path/to/audio_wav.parquet',
raw_input_path = '/path/to/audio_wav/*.wav', #(1) # auto replace '/path/to/audio_wav/' to ''
output_path = './output/path',
write_mode = "recovery"
)
converter.execute()
Listen audio in parquet
from cads_sdk.nosql.display import Audio
Audio('file:/path/to/audio_mp3.parquet')
Video
from cads_sdk.nosql.converter import ConvertFromVideo2Image
converter = ConvertFromVideo2Image(
input_path='/home/username/vid/palawan1.mp4',
output_path = f"file:/home/username/vid_image.parquet",
)
converter.execute()
For more information use class instance
ConvertFromFolderImage.__doc__
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for cads_sdk-0.0.16-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e967f89005cb3676069f458d2663c46cbeab925d09cd765e19ef7477710f83eb |
|
MD5 | 126457be6a36b79ef81585ffa0773d9d |
|
BLAKE2b-256 | 5e8e707c2c2f187d04936a882208aa8c43641c9e26fcf3877a03d1cd2a8eb413 |