Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.
Project description
Lines Dataset
Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.
pip install lines-dataset
Format
A dataset folder looks like this:
my-dataset/
meta.json
my-inputs.txt
my-compressed-labels.txt.zst
other-labels.txt.zst
...
meta.json:
{
"lines_dataset": {
"inputs": {
"file": "my-inputs.txt",
"num_lines": 3000 // optionally specify the number of lines
},
"labels": {
"file": "my-compressed-labels.txt.zst",
"compression": "zstd",
"num_lines": 3000
},
"other-labels": {
"file": "other-labels.txt.zst",
"compression": "zstd",
"num_lines": 2000 // not all files need to have the same number of lines, as long as samples match line by line. The shortest file will determine the length of the dataset.
},
},
// you can add other stuff if you want to
}
Usage
import lines_dataset as lds
ds = lds.Dataset.read('path/to/my-dataset')
num_samples = ds.len('inputs', 'labels') # int | None
for x in ds.samples('inputs', 'labels'):
x['inputs'] # "the first line of inputs.txt\n"
x['labels'] # "the decompressed first line of labels.txt.zst\n"
A common convenience to use is:
import lines_dataset as lds
datasets = lds.glob('path/to/datasets/*') # list[lds.Dataset]
for x in lds.chain(datasets, 'inputs', 'labels'):
...
And that's it! Simple.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lines_dataset-0.2.9.tar.gz
(4.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lines_dataset-0.2.9.tar.gz.
File metadata
- Download URL: lines_dataset-0.2.9.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dfe2c30b129c87d9a17fa26a56d5e9c37bbb7491b9262e7c43f6e6c87d5c665
|
|
| MD5 |
b8143cbb33644534d849ada941e2949a
|
|
| BLAKE2b-256 |
37278d34ba8fa7e6c30e367a1dcac005fb6a270bdd02eb998da37b5334fd7312
|
File details
Details for the file lines_dataset-0.2.9-py3-none-any.whl.
File metadata
- Download URL: lines_dataset-0.2.9-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f79a666028724421e8273557eee711d089503d4b333ccac3989e08dc91abff55
|
|
| MD5 |
6fb4bad7c0cd20f1a88e9ebc350ec094
|
|
| BLAKE2b-256 |
a41d40d07885c5da7979772d0c15aa5799bef4807494d3cd17086e45ff334b6f
|