Skip to main content

Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.

Project description

Lines Dataset

Dead simple standard for storing/loading datasets as lines of text. Supports zstd compression.

pip install lines-dataset

Format

A dataset folder looks like this:

my-dataset/
  meta.json
  my-inputs.txt
  my-compressed-labels.txt.zst
  other-labels.txt.zst
  ...

meta.json:

{
  "lines_dataset": {
    "inputs": {
      "file": "my-inputs.txt",
      "num_lines": 3000 // optionally specify the number of lines
    },
    "labels": {
      "file": "my-compressed-labels.txt.zst",
      "compression": "zstd",
      "num_lines": 3000
    },
    "other-labels": {
      "file": "other-labels.txt.zst",
      "compression": "zstd",
      "num_lines": 2000 // not all files need to have the same number of lines, as long as samples match line by line. The shortest file will determine the length of the dataset.
    },
  },
  // you can add other stuff if you want to
}

Usage

import lines_dataset as lds

ds = lds.Dataset.read('path/to/my-dataset')
num_samples = ds.len('inputs', 'labels') # int | None

for x in ds.samples('inputs', 'labels'):
  x['inputs'] # "the first line of inputs.txt\n"
  x['labels'] # "the decompressed first line of labels.txt.zst\n"

A common convenience to use is:

import lines_dataset as lds

datasets = lds.glob('path/to/datasets/*') # list[lds.Dataset]
for x in lds.chain(datasets, 'inputs', 'labels'):
  ...

And that's it! Simple.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lines_dataset-0.2.9.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

lines_dataset-0.2.9-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file lines_dataset-0.2.9.tar.gz.

File metadata

  • Download URL: lines_dataset-0.2.9.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lines_dataset-0.2.9.tar.gz
Algorithm Hash digest
SHA256 2dfe2c30b129c87d9a17fa26a56d5e9c37bbb7491b9262e7c43f6e6c87d5c665
MD5 b8143cbb33644534d849ada941e2949a
BLAKE2b-256 37278d34ba8fa7e6c30e367a1dcac005fb6a270bdd02eb998da37b5334fd7312

See more details on using hashes here.

File details

Details for the file lines_dataset-0.2.9-py3-none-any.whl.

File metadata

File hashes

Hashes for lines_dataset-0.2.9-py3-none-any.whl
Algorithm Hash digest
SHA256 f79a666028724421e8273557eee711d089503d4b333ccac3989e08dc91abff55
MD5 6fb4bad7c0cd20f1a88e9ebc350ec094
BLAKE2b-256 a41d40d07885c5da7979772d0c15aa5799bef4807494d3cd17086e45ff334b6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page