Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.
Project description
Lines Dataset
Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.
pip install lines-dataset
Format
A dataset folder looks like this:
my-dataset/
meta.json
my-inputs.txt
my-compressed-labels.txt.zst
other-labels.txt.zst
...
meta.json:
{
"lines_dataset": {
"inputs": {
"file": "my-inputs.txt",
"num_lines": 3000 // optionally specify the number of lines
},
"labels": {
"file": "my-compressed-labels.txt.zst",
"compression": "zstd",
"num_lines": 3000
},
"other-labels": {
"file": "other-labels.txt.zst",
"compression": "zstd",
"num_lines": 2000 // not all files need to have the same number of lines, as long as samples match line by line. The shortest file will determine the length of the dataset.
},
},
// you can add other stuff if you want to
}
Usage
import lines_dataset as lds
ds = lds.Dataset.read('path/to/my-dataset')
num_samples = ds.len('inputs', 'labels') # int | None
for x in ds.samples('inputs', 'labels'):
x['inputs'] # "the first line of inputs.txt\n"
x['labels'] # "the decompressed first line of labels.txt.zst\n"
A common convenience to use is:
import lines_dataset as lds
datasets = lds.glob('path/to/datasets/*') # list[lds.Dataset]
for x in lds.chain(datasets, 'inputs', 'labels'):
...
And that's it! Simple.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lines_dataset-0.2.9.tar.gz.
File metadata
- Download URL: lines_dataset-0.2.9.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dfe2c30b129c87d9a17fa26a56d5e9c37bbb7491b9262e7c43f6e6c87d5c665
|
|
| MD5 |
b8143cbb33644534d849ada941e2949a
|
|
| BLAKE2b-256 |
37278d34ba8fa7e6c30e367a1dcac005fb6a270bdd02eb998da37b5334fd7312
|
File details
Details for the file lines_dataset-0.2.9-py3-none-any.whl.
File metadata
- Download URL: lines_dataset-0.2.9-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f79a666028724421e8273557eee711d089503d4b333ccac3989e08dc91abff55
|
|
| MD5 |
6fb4bad7c0cd20f1a88e9ebc350ec094
|
|
| BLAKE2b-256 |
a41d40d07885c5da7979772d0c15aa5799bef4807494d3cd17086e45ff334b6f
|