Simple tools for storing inputs + labels datasets as one sample per line
Project description
Lines Dataset
Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.
pip install lines-dataset
Format
A dataset folder looks like this:
my-dataset/
meta.json
my-inputs.txt
my-compressed-labels.txt.zst
other-labels.txt.zst
...
meta.json
:
{
"lines-dataset": {
"inputs": {
"file": "my-inputs.txt",
"samples": 3000 // optionally specify the number of lines
},
"labels": {
"file": "my-compressed-labels.txt.zst",
"compression": "zstd",
"samples": 3000
},
"other-labels": {
"file": "other-labels.txt.zst",
"compression": "zstd",
"samples": 2000 // not all files need to have the same number of lines, as long as samples match line by line. The shortest file will determine the length of the dataset.
},
},
// you can add other stuff if you want to
}
import lines_dataset as lds
ds = lds.Dataset.read('path/to/my-dataset')
num_samples = ds.len('inputs', 'labels') # int | None
for x in ds.samples('inputs', 'labels'):
x['inputs'] # "the first line of inputs.txt\n"
x['labels'] # "the decompressed first line of labels.txt.zst\n"
And that's it! Simple.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lines_dataset-0.2.0.tar.gz
(3.2 kB
view hashes)
Built Distribution
Close
Hashes for lines_dataset-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d465ed6a0645ed9a2672597611631b95ccd009c4dfa1601335182ce09ebeef1 |
|
MD5 | ea8afe0ad1960c4e1a58a28bf78bb22f |
|
BLAKE2b-256 | 4faf2db104591556212ba9c944720a1ccfcbcc2ba9266ccec0734b6edb9dd48d |