Skip to main content

A collection of tools for machine learning.

Project description

cylearn

cylearn contains a collection of tools for machine learning. The 'cy' prefix is just the abbreviation of my given name.

Warning

This package is still under development, use with caution.

Installation

You can install cylearn through PyPI:

$ pip install cylearn

Or just copy the files you need to your project, but don't forget the LICENSE.

Dependencies

Submodules

There is only one submodule so far.

  • Data
    • Dataset
    • Loader

Data loading with cylearn.Data

cylearn.Data provides two classes Dataset and Loader and several functions shuffle(), split(), and get_loader().

We first use some examples for a quick start before going into details. These examples are presented in the jupyter style.

Example 1

import numpy as np
from cylearn.Data import Dataset, Loader
x = Dataset(np.arange(5)).map(lambda i: i ** 2)
for _ in x: print(_)
0
1
4
9
16
loader_x = Loader(x, batch_size=2, shuffle=False)
for _ in loader_x: print(_)
[0, 1]
[4, 9]
[16]

Example 2

+- data/
|  +- 1.png
|  +- 1.txt
|  +- 2.png
|  +- 2.txt
|  |  ...
|  +- 5.png
|  +- 5.txt
|
+- demo.py

The above is the structure of an imaginary folder, where x.png is an image and x.txt stores an integer which is the label of x.png.

Assume the memory can only store two images once at a time because these images are too large. We now demonstrate how to make batches for these images and labels with Dataset and Loader.

import os
import numpy
from cylearn.Data import Dataset, get_loader
# Filter the correct files by their extension and recover their dirname.
images = Dataset(os.listdir('./data/')).filter(lambda f: '.png' in f).map(lambda f: './data/' + f)
labels = Dataset(os.listdir('./data/')).filter(lambda f: '.txt' in f).map(lambda f: './data/' + f)
for i, l in zip(images, labels): print(i, l)
./data/1.png ./data/1.txt
./data/2.png ./data/2.txt
./data/3.png ./data/3.txt
./data/4.png ./data/4.txt
./data/5.png ./data/5.txt
# Define functions for reading images and labels.
# Import statements must be inside a function to make multiprocessing work.
# This makes sure the name of the imported module is inside the local symbol table.
def read_image(path):
    '''
    Returns a numpy array.
    '''
    import numpy as np
    from PIL import Image
    return np.asarray(Image.open(path))

def read_label(path):
    '''
    Returns an integer.
    '''
    with open(path, 'r') as f:
        return int(f.readline())
# The original data, which is a list of strings here, won't change after lazy_map() is called.
# If using map(), a list of strings will be transformed into a list of numpy arrays or integers.
# This is the key to solve the memory issue.
images = images.lazy_map(read_image)
labels = labels.lazy_map(read_label)
# Dataset.get() is a method to retrieve the stored data.
# We can see the mapping occurs when __getitem__() is invoked.
# But the stored data won't changed before and after invoking __getitem__().
for i in range(len(images)): print(images.get(i), type(images[i]))
for i in range(len(labels)): print(labels.get(i), type(labels[i]))
./data/1.png <class 'numpy.ndarray'>
./data/2.png <class 'numpy.ndarray'>
./data/3.png <class 'numpy.ndarray'>
./data/4.png <class 'numpy.ndarray'>
./data/5.png <class 'numpy.ndarray'>
./data/1.txt <class 'int'>
./data/2.txt <class 'int'>
./data/3.txt <class 'int'>
./data/4.txt <class 'int'>
./data/5.txt <class 'int'>
# Use two workers to read data.
# An error will occur if 'multiprocess' is not installed.
# Fix it by installing 'multiprocess' or not passing `parallel`.
images_loader, labels_loader = get_loader(images, labels, batch_size=2, parallel=2)
for X, y in zip(images_loader, labels_loader): print(len(X), len(y))
2 2
2 2
1 1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cylearn-0.1.4.tar.gz (22.1 kB view hashes)

Uploaded Source

Built Distribution

cylearn-0.1.4-py3-none-any.whl (21.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page