csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion
Project description
csv-dataset
CsvDataset
helps to read a csv file and create descriptive and efficient input pipelines for deep learning.
CsvDataset
iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.
Install
$ pip install csv-dataset
Usage
Suppose we have a csv file whose absolute path is filepath
:
open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...
from csv_dataset import (
Dataset,
CsvReader
)
dataset = CsvDataset(
CsvReader(
filepath,
float,
# Abandon the first column and only pick the following
indexes=[1, 2, 3, 4, 5],
header=True
)
).window(3, 1).batch(2)
for element in dataset:
print(element)
The following output shows one print.
[[[7145.99, 7150.0, 7141.01, 7142.33, 21.094283]
[7142.89, 7142.99, 7120.7, 7125.73, 118.279931]
[7125.76, 7134.46, 7123.12, 7123.12, 41.03628 ]]
[[7142.89, 7142.99, 7120.7, 7125.73, 118.279931]
[7125.76, 7134.46, 7123.12, 7123.12, 41.03628 ]
[7123.74, 7128.06, 7117.12, 7126.57, 39.885367]]]
...
Dataset(reader: AbstractReader)
dataset.window(size: int, shift: int = None, stride: int = 1) -> self
Defines the window size, shift and stride.
The default window size is 1
which means the dataset has no window.
Parameter explanation
Suppose we have a raw data set
[ 1 2 3 4 5 6 7 8 9 ... ]
And the following is a window of (size=4, shift=3, stride=2)
|-------------- size:4 --------------|
|- stride:2 -| |
| | |
win 0: [ 1 3 5 7 ] --------|-----
shift:3
win 1: [ 4 6 8 10 ] --------|-----
win 2: [ 7 9 11 13 ]
...
dataset.batch(batch: int) -> self
Defines batch size.
The default batch size of the dataset is 1
which means it is single-batch
If batch is 2
batch 0: [[ 1 3 5 7 ]
[ 4 6 8 10 ]]
batch 1: [[ 7 9 11 13 ]
[ 10 12 14 16 ]]
...
dataset.get() -> Optional[np.ndarray]
Gets the data of the next batch
dataset.reset() -> self
Resets dataset
dataset.read(amount: int, reset_buffer: bool = False)
- amount the maximum length of data the dataset will read
- reset_buffer if
True
, the dataset will reset the data of the previous window in the buffer
Reads multiple batches at a time
If we reset_buffer
, then the next read will not use existing data in the buffer, and the result will have no overlap with the last read.
dataset.reset_buffer() -> None
Reset buffer, so that the next read will have no overlap with the last one
dataset.lines_need(reads: int) -> int
Calculates and returns how many lines of the underlying datum are needed for reading reads
times
dataset.max_reads(max_lines: int) -> int | None
Calculates max_lines
lines could afford how many reads
dataset.max_reads() -> int | None
Calculates the current reader could afford how many reads.
If max_lines
of current reader is unset, then it returns None
CsvReader(filepath, dtype, indexes, **kwargs)
- filepath
str
absolute path of the csv file - dtype
Callable
data type. We should only usefloat
orint
for this argument. - indexes
List[int]
column indexes to pick from the lines of the csv file - kwargs
- header
bool = False
whether we should skip reading the header line. - splitter
str = ','
the column splitter of the csv file - normalizer
List[NormalizerProtocol]
list of normalizer to normalize each column of data. ANormalizerProtocol
should contains two methods,normalize(float) -> float
to normalize the given datum andrestore(float) -> float
to restore the normalized datum. - max_lines
int = -1
max lines of the csv file to be read. Defaults to-1
which means no limit.
- header
reader.reset()
Resets reader pos
property reader.max_lines
Gets max_lines
setter reader.max_lines = lines
Changes max_lines
reader.readline() -> list
Returns the converted value of the next line
reader csvReader.lines
Returns number of lines has been read
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file csv-dataset-3.5.0.tar.gz
.
File metadata
- Download URL: csv-dataset-3.5.0.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d115861019f7d1b1bedc802cfda077907e42f2ffaf9109e8d57f0cbd467a3ef |
|
MD5 | 1a491fcbb5bc59bd8848f805d03a0ff9 |
|
BLAKE2b-256 | e914d504b2a84cb0ebcec98b5fdbefa59dfcd2e4cd5526f1bb81ee53fecb294e |
File details
Details for the file csv_dataset-3.5.0-py3-none-any.whl
.
File metadata
- Download URL: csv_dataset-3.5.0-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 769acab9ff8ef625c9b8aa8cc3b031bce63ce5920a18c1751db4468cb2bd5bb9 |
|
MD5 | 7ba61e4b9345080c14dcb9b794289774 |
|
BLAKE2b-256 | 3ec097659efab3dbb451e7c16003fa9c72fed5591207a158c68c7522da3980df |