Skip to main content

Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.

Project description

Lines Dataset

Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.

pip install lines-dataset

Format

A dataset folder looks like this:

my-dataset/
  meta.json
  my-inputs.txt
  my-compressed-labels.txt.zst
  other-labels.txt.zst
  ...

meta.json:

{
  "lines_dataset": {
    "inputs": {
      "file": "my-inputs.txt",
      "num_lines": 3000 // optionally specify the number of lines
    },
    "labels": {
      "file": "my-compressed-labels.txt.zst",
      "compression": "zstd",
      "num_lines": 3000
    },
    "other-labels": {
      "file": "other-labels.txt.zst",
      "compression": "zstd",
      "num_lines": 2000 // not all files need to have the same number of lines, as long as samples match line by line. The shortest file will determine the length of the dataset.
    },
  },
  // you can add other stuff if you want to
}

Usage

import lines_dataset as lds

ds = lds.Dataset.read('path/to/my-dataset')
num_samples = ds.len('inputs', 'labels') # int | None

for x in ds.samples('inputs', 'labels'):
  x['inputs'] # "the first line of inputs.txt\n"
  x['labels'] # "the decompressed first line of labels.txt.zst\n"

A common convenience to use is:

import lines_dataset as lds

datasets = lds.glob('path/to/datasets/*') # list[lds.Dataset]
for x in lds.chain(datasets, 'inputs', 'labels'):
  ...

And that's it! Simple.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lines_dataset-0.2.7.tar.gz (3.6 kB view hashes)

Uploaded Source

Built Distribution

lines_dataset-0.2.7-py3-none-any.whl (4.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page