Skip to main content

csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion

Project description

csv-dataset

CsvDataset helps to read a csv file and create descriptive and efficient input pipelines for deep learning.

CsvDataset iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.

Install

$ pip install csv-dataset

Usage

Suppose we have a csv file whose absolute path is filepath:

open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...
from csv_dataset import (
    Dataset,
    CsvReader
)

dataset = CsvDataset(
    CsvReader(
        filepath,
        float,
        # Abandon the first column and only pick the following
        indexes=[1, 2, 3, 4, 5],
        header=True
    )
).window(3, 1).batch(2)

for element in dataset:
    print(element)

The following output shows one print.

[[[7145.99,  7150.0,   7141.01,  7142.33,   21.094283]
  [7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]]

 [[7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]
  [7123.74,  7128.06,  7117.12,  7126.57,   39.885367]]]

...

Dataset(reader: AbstractReader)

dataset.window(size: int, shift: int = None, stride: int = 1) -> self

Defines the window size, shift and stride.

The default window size is 1 which means the dataset has no window.

Parameter explanation

Suppose we have a raw data set

[ 1  2  3  4  5  6  7  8  9 ... ]

And the following is a window of (size=4, shift=3, stride=2)

          |-------------- size:4 --------------|
          |- stride:2 -|                       |
          |            |                       |
win 0:  [ 1            3           5           7  ] --------|-----
                                                       shift:3
win 1:  [ 4            6           8           10 ] --------|-----

win 2:  [ 7            9           11          13 ]

...

dataset.batch(batch: int) -> self

Defines batch size.

The default batch size of the dataset is 1 which means it is single-batch

If batch is 2

batch 0:  [[ 1            3           5           7  ]
           [ 4            6           8           10 ]]

batch 1:  [[ 7            9           11          13 ]
           [ 10           12          14          16 ]]

...

dataset.get() -> Optional[np.ndarray]

Gets the data of the next batch

dataset.reset() -> self

Resets dataset

dataset.read(amount: int, reset_buffer: bool = False)

  • amount the maximum length of data the dataset will read
  • reset_buffer if True, the dataset will reset the data of the previous window in the buffer

Reads multiple batches at a time

If we reset_buffer, then the next read will not use existing data in the buffer, and the result will have no overlap with the last read.

dataset.reset_buffer() -> None

Reset buffer, so that the next read will have no overlap with the last one

dataset.lines_need(reads: int) -> int

Calculates and returns how many lines of the underlying datum are needed for reading reads times

dataset.max_reads(max_lines: int) -> int | None

Calculates max_lines lines could afford how many reads

dataset.max_reads() -> int | None

Calculates the current reader could afford how many reads.

If max_lines of current reader is unset, then it returns None

CsvReader(filepath, dtype, indexes, **kwargs)

  • filepath str absolute path of the csv file
  • dtype Callable data type. We should only use float or int for this argument.
  • indexes List[int] column indexes to pick from the lines of the csv file
  • kwargs
    • header bool = False whether we should skip reading the header line.
    • splitter str = ',' the column splitter of the csv file
    • normalizer List[NormalizerProtocol] list of normalizer to normalize each column of data. A NormalizerProtocol should contains two methods, normalize(float) -> float to normalize the given datum and restore(float) -> float to restore the normalized datum.
    • max_lines int = -1 max lines of the csv file to be read. Defaults to -1 which means no limit.

reader.reset()

Resets reader pos

property reader.max_lines

Gets max_lines

setter reader.max_lines = lines

Changes max_lines

reader.readline() -> list

Returns the converted value of the next line

reader csvReader.lines

Returns number of lines has been read

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv-dataset-3.5.0.tar.gz (9.8 kB view hashes)

Uploaded Source

Built Distribution

csv_dataset-3.5.0-py3-none-any.whl (8.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page