Skip to main content

csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion

Project description

csv-dataset

CsvDataset helps to read a csv file and create descriptive and efficient input pipelines for deep learning.

CsvDataset iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.

Install

$ pip install csv-dataset

Usage

Suppose we have a csv file whose absolute path is filepath:

open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...
from csv_dataset import (
    Dataset,
    CsvReader
)

dataset = CsvDataset(
    CsvReader(
        filepath,
        float,
        # Abandon the first column and only pick the following
        indexes=[1, 2, 3, 4, 5],
        header=True
    )
).window(3, 1).batch(2)

for element in dataset:
    print(element)

The following output shows one print.

[[[7145.99,  7150.0,   7141.01,  7142.33,   21.094283]
  [7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]]

 [[7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]
  [7123.74,  7128.06,  7117.12,  7126.57,   39.885367]]]

...

Dataset(reader: AbstractReader)

dataset.window(size: int, shift: int = None, stride: int = 1) -> self

Defines the window size, shift and stride.

The default window size is 1 which means the dataset has no window.

Parameter explanation

Suppose we have a raw data set

[ 1  2  3  4  5  6  7  8  9 ... ]

And the following is a window of (size=4, shift=3, stride=2)

          |-------------- size:4 --------------|
          |- stride:2 -|                       |
          |            |                       |
win 0:  [ 1            3           5           7  ] --------|-----
                                                       shift:3
win 1:  [ 4            6           8           10 ] --------|-----

win 2:  [ 7            9           11          13 ]

...

dataset.batch(batch: int) -> self

Defines batch size.

The default batch size of the dataset is 1 which means it is single-batch

If batch is 2

batch 0:  [[ 1            3           5           7  ]
           [ 4            6           8           10 ]]

batch 1:  [[ 7            9           11          13 ]
           [ 10           12          14          16 ]]

...

dataset.get() -> Optional[np.ndarray]

Gets the data of the next batch

dataset.reset() -> self

Resets dataset

dataset.read(amount: int, reset_buffer: bool = False)

  • amount the maximum length of data the dataset will read
  • reset_buffer if True, the dataset will reset the data of the previous window in the buffer

Reads multiple batches at a time

If we reset_buffer, then the next read will not use existing data in the buffer, and the result will have no overlap with the last read.

dataset.reset_buffer() -> None

Reset buffer, so that the next read will have no overlap with the last one

dataset.lines_need(reads: int) -> int

Calculates and returns how many lines of the underlying datum are needed for reading reads times

dataset.max_reads(max_lines: int) -> int | None

Calculates max_lines lines could afford how many reads

dataset.max_reads() -> int | None

Calculates the current reader could afford how many reads.

If max_lines of current reader is unset, then it returns None

CsvReader(filepath, dtype, indexes, **kwargs)

  • filepath str absolute path of the csv file
  • dtype Callable data type. We should only use float or int for this argument.
  • indexes List[int] column indexes to pick from the lines of the csv file
  • kwargs
    • header bool = False whether we should skip reading the header line.
    • splitter str = ',' the column splitter of the csv file
    • normalizer List[NormalizerProtocol] list of normalizer to normalize each column of data. A NormalizerProtocol should contains two methods, normalize(float) -> float to normalize the given datum and restore(float) -> float to restore the normalized datum.
    • max_lines int = -1 max lines of the csv file to be read. Defaults to -1 which means no limit.

reader.reset()

Resets reader pos

property reader.max_lines

Gets max_lines

setter reader.max_lines = lines

Changes max_lines

reader.readline() -> list

Returns the converted value of the next line

reader csvReader.lines

Returns number of lines has been read

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv-dataset-3.5.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

csv_dataset-3.5.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file csv-dataset-3.5.0.tar.gz.

File metadata

  • Download URL: csv-dataset-3.5.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for csv-dataset-3.5.0.tar.gz
Algorithm Hash digest
SHA256 3d115861019f7d1b1bedc802cfda077907e42f2ffaf9109e8d57f0cbd467a3ef
MD5 1a491fcbb5bc59bd8848f805d03a0ff9
BLAKE2b-256 e914d504b2a84cb0ebcec98b5fdbefa59dfcd2e4cd5526f1bb81ee53fecb294e

See more details on using hashes here.

File details

Details for the file csv_dataset-3.5.0-py3-none-any.whl.

File metadata

  • Download URL: csv_dataset-3.5.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for csv_dataset-3.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 769acab9ff8ef625c9b8aa8cc3b031bce63ce5920a18c1751db4468cb2bd5bb9
MD5 7ba61e4b9345080c14dcb9b794289774
BLAKE2b-256 3ec097659efab3dbb451e7c16003fa9c72fed5591207a158c68c7522da3980df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page