Skip to main content

Function to help Data Scientist work more effectively with DWH

Project description


cads-sdk: Functions to help Data Scientist work more effectively with unstructured data and datalake

PyPI Latest Release Package Status Downloads Powered by NumFOCUS

What is it?

cads-sdk Function to help Data Scientist work more effectively with unstructured data. Include different function work with image, audio

Main Features

Here are just a few of the things that cads-sdk does well:

  • Data pre-processing: Faster upto 25% compare with cv2.imread()
    • Image pre-processing, convert from Image Folder/Zipfile to Parquet/delta, ready for training
    • Audio pre-processing, convert from Folder Audio, ready for training
  • Optimize storage: Does not reduce image quality but optimize memory 12%
    • Decrease number of small file
    • Take advantage of compression zstd parquet
  • View image/audio but not get system down (browser ran out of memory)

Install

Binary installers for the latest released version are available at the Python Package Index (PyPI).

# with PyPI
pip install cads-sdk

Dependencies

Installation from sources

To install cads-sdk from source you need Cython in addition to the normal dependencies above. Cython can be installed from PyPI:

pip install cython

In the cads-sdk directory (same one where you found this file after cloning the git repo), execute:

python setup.py install

Documentation

The official documentation is hosted on PyData.org: https://pandas.pydata.org/pandas-docs/stable

Image

Convert a folder image to parquet

from cads_sdk.nosql.converter import ConvertFromFolderImage

converter = ConvertFromFolderImage(
              input_path="/path/to/folder/**/*.jpg",
              input_type = 'jpg' # 'jpg' | ('jpg', 'png')
              input_recursive = True,

              #setting output
              output_path = f"file:/output/path/image_storage",

              # setting converter
              image_type = 'jpg',
              image_color = 3,
              resize_mode=None, # |padding|resize
              size = [(212,212),
                     (597, 597)],
             )

converter.execute()

# convert directly from .zip file to parquet
from cads_sdk.nosql.converter import ConvertFromZipImage

converter = ConvertFromZipImage(
              input_path="/path/to/image_storage/ETHZ.zip",
              input_recursive = True, # will loop through folder to get all pattern
              input_type = 'jpg' # 'jpg' | ('jpg', 'png')

              #setting output
              output_path = f"file:/output/path/img_ethz.parquet",
              table_name = 'img_ethz',
              database = 'default',
              file_format = 'parquet', # delta|parquet
              compression = 'zstd', # zstd|snappy
              # setting converter
              image_type = 'png',
              image_color = 3,
              resize_mode=None, # |padding|resize
              size = [(1080,1920)],
              debug = False
             )

converter.execute()

Convert a Image parquet file back to Image Folder

from cads_sdk.nosql.converter import ConvertToFolderImage

converter = ConvertToFolderImage(
    input_path = '/user/username/image/img_user_device_jpg_212_212.parquet',
    raw_input_path = "/home/username/image_storage/device_images/**/*.jpg",
    output_path = './abc/'
)

converter.execute()

Function to read image

from cads.nosql import display
import cads_sdk as ss

df = ss.sql("""select * from parquet.`/user/duyvnc/image/img_images_jpg_212_212.parquet`""")
pdf = df.toPandasImage(limit=50)
pdf

pdf = ss.sql("""
select *
from parquet.`file:/home/duyvnc/image_storage/img_mot17_1080_1920.parquet`
limit 100
""").toPandasImage(mode='BGR')

Pytorch API

from cads_sdk.nosql import codec
from petastorm import make_reader, TransformSpec
from petastorm.pytorch import DataLoader
num_epochs = 10
with DataLoader(reader=make_reader('{}/train'.format(dataset_url), reader_pool_type='dummy', num_epochs=num_epochs,
                            transform_spec=transform), batch_size=32) as train_loader:
    train(model, device, train_loader, 2000, optimizer, num_epochs)

Audio

Suport pcm, mp3, wav format

Convert a folder audio to parquet

from cads_sdk.nosql.converter import ConvertFromFolderImage

converter = ConvertFromFolderAudio(
              input_path='/path/to/audio_wav/*.wav', #(1)
              input_type = 'wav' # 'wav'| 'mp3' | 'pcm' ('wav', 'mp3')
              input_recursive = False,
              output_path = f"file:/output/path/audio_wav.parquet",
             )

converter.execute()

Convert a parquet to folder audio

converter = ConvertToFolderAudio(
input_path = 'file:/path/to/audio_wav.parquet',
raw_input_path = '/path/to/audio_wav/*.wav', #(1) # auto replace '/path/to/audio_wav/' to ''
output_path = './output/path',
write_mode = "recovery"
)

converter.execute()

Listen audio in parquet

from cads_sdk.nosql.display import Audio
Audio('file:/path/to/audio_mp3.parquet')

Video

from cads_sdk.nosql.converter import ConvertFromVideo2Image
converter = ConvertFromVideo2Image(
              input_path='/home/username/vid/palawan1.mp4',
              output_path = f"file:/home/username/vid_image.parquet",
             )

converter.execute()

For more information use class instance

ConvertFromFolderImage.__doc__

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

cads_sdk-0.0.17rc2-py3-none-any.whl (55.3 kB view details)

Uploaded Python 3

File details

Details for the file cads_sdk-0.0.17rc2-py3-none-any.whl.

File metadata

File hashes

Hashes for cads_sdk-0.0.17rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 ad02cc41a33d39fc2e432f70d9ca6a2cd366d2027bb6697c6f8b3b40fe9cffaf
MD5 a8fb2de0c846a7594a31457dbb7a8a2b
BLAKE2b-256 945fd054f8894bd49c124eea1e7f8ee6f629d182b85d0c270bca81aa27afaa7c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page