Skip to main content

Function to help Data Scientist work more effectively with DWH

Project description


cads-sdk: Functions to help Data Scientist work more effectively with unstructured data and datalake

PyPI Latest Release Package Status Downloads Powered by NumFOCUS

What is it?

cads-sdk Function to help Data Scientist work more effectively with unstructured data. Include different function work with image, audio

Main Features

Here are just a few of the things that cads-sdk does well:

  • Data pre-processing: Faster upto 25% compare with cv2.imread()
    • Image pre-processing, convert from Image Folder/Zipfile to Parquet/delta, ready for training
    • Audio pre-processing, convert from Folder Audio, ready for training
  • Optimize storage: Does not reduce image quality but optimize memory 12%
    • Decrease number of small file
    • Take advantage of compression zstd parquet
  • View image/audio but not get system down (browser ran out of memory)

Install

Binary installers for the latest released version are available at the Python Package Index (PyPI).

# with PyPI
pip install cads-sdk

Dependencies

Installation from sources

To install cads-sdk from source you need Cython in addition to the normal dependencies above. Cython can be installed from PyPI:

pip install cython

In the cads-sdk directory (same one where you found this file after cloning the git repo), execute:

python setup.py install

Documentation

The official documentation is hosted on PyData.org: https://pandas.pydata.org/pandas-docs/stable

Image

Convert a folder image to parquet

from cads_sdk.nosql.converter import ConvertFromFolderImage

converter = ConvertFromFolderImage(
              input_path="/path/to/folder/**/*.jpg",
              input_type = 'jpg' # 'jpg' | ('jpg', 'png')
              input_recursive = True,

              #setting output
              output_path = f"file:/output/path/image_storage",

              # setting converter
              image_type = 'jpg',
              image_color = 3,
              resize_mode=None, # |padding|resize
              size = [(212,212),
                     (597, 597)],
             )

converter.execute()

# convert directly from .zip file to parquet
from cads_sdk.nosql.converter import ConvertFromZipImage

converter = ConvertFromZipImage(
              input_path="/path/to/image_storage/ETHZ.zip",
              input_recursive = True, # will loop through folder to get all pattern
              input_type = 'jpg' # 'jpg' | ('jpg', 'png')

              #setting output
              output_path = f"file:/output/path/img_ethz.parquet",
              table_name = 'img_ethz',
              database = 'default',
              file_format = 'parquet', # delta|parquet
              compression = 'zstd', # zstd|snappy
              # setting converter
              image_type = 'png',
              image_color = 3,
              resize_mode=None, # |padding|resize
              size = [(1080,1920)],
              debug = False
             )

converter.execute()

Convert a Image parquet file back to Image Folder

from cads_sdk.nosql.converter import ConvertToFolderImage

converter = ConvertToFolderImage(
    input_path = '/user/username/image/img_user_device_jpg_212_212.parquet',
    raw_input_path = "/home/username/image_storage/device_images/**/*.jpg",
    output_path = './abc/'
)

converter.execute()

Function to read image

from cads.nosql import display
import cads_sdk as ss

df = ss.sql("""select * from parquet.`/user/duyvnc/image/img_images_jpg_212_212.parquet`""")
pdf = df.toPandasImage(limit=50)
pdf

pdf = ss.sql("""
select *
from parquet.`file:/home/duyvnc/image_storage/img_mot17_1080_1920.parquet`
limit 100
""").toPandasImage(mode='BGR')

Pytorch API

from cads_sdk.nosql import codec
from petastorm import make_reader, TransformSpec
from petastorm.pytorch import DataLoader
num_epochs = 10
with DataLoader(reader=make_reader('{}/train'.format(dataset_url), reader_pool_type='dummy', num_epochs=num_epochs,
                            transform_spec=transform), batch_size=32) as train_loader:
    train(model, device, train_loader, 2000, optimizer, num_epochs)

Audio

Suport pcm, mp3, wav format

Convert a folder audio to parquet

from cads_sdk.nosql.converter import ConvertFromFolderImage

converter = ConvertFromFolderAudio(
              input_path='/path/to/audio_wav/*.wav', #(1)
              input_type = 'wav' # 'wav'| 'mp3' | 'pcm'
              input_recursive = False,
              output_path = f"file:/output/path/audio_wav.parquet",
             )

converter.execute()

Convert a parquet to folder audio

converter = ConvertToFolderAudio(
input_path = 'file:/path/to/audio_wav.parquet',
raw_input_path = '/path/to/audio_wav/*.wav', #(1) # auto replace '/path/to/audio_wav/' to ''
output_path = './output/path',
write_mode = "recovery"
)

converter.execute()

Listen audio in parquet

from cads_sdk.nosql.display import Audio
Audio('file:/path/to/audio_mp3.parquet')

Video

from cads_sdk.nosql.converter import ConvertFromVideo2Image
converter = ConvertFromVideo2Image(
              input_path='/home/username/vid/palawan1.mp4',
              output_path = f"file:/home/username/vid_image.parquet",
             )

converter.execute()

For more information use class instance

ConvertFromFolderImage.__doc__

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

cads_sdk-0.0.27-py3-none-any.whl (58.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page