Skip to main content

Utility functions for processing TinyStories dataset by Eldan & Li

Project description

tinytok

DISCLAIMER: This README.md was written by GPT Grok | The docstrings for the functions were written by GPT Grok.

Simple utility funcs to process TinyStories by Eldan & Li, train a Byte-Pair Encoding (BPE) tokenizer, and create tokenized sequences to train tiny transformer models.

Primarily made for personal use.

Features

  • Read and concatenate .parquet text datasets
  • Optionally append EOS tokens and return raw text
  • Train a new BPE tokenizer with tokenizers library
  • Tokenize using the trained tokenizer into PyTorch tensors
  • Generate sequences for transformer model training

Installation

pip install tinytok

Example Usage

import torch
from tinytok import data_process, tokenize, train_new_tokenizer_bpe, create_sequences

model_tokenizer_name = 'EleutherAI/gpt-neo-1.3B'

file_1 = 'data/train1.parquet'
file_2 = 'data/train2.parquet'
file_3 = 'data/train3.parquet'
file_4 = 'data/train4.parquet'
file_val = 'data/validation.parquet'

#files = [file_1, file_2, file_3, file_4]
files = [file_1]
file_val = [file_val]

# PARAMS -----------------
 
return_single_str = True
vocab_size = 10000
special_tokens = ['<|endoftext|>']
save_path = 'data/tokenizer.json'
return_freqs = False
return_flat_tnsr = True
create_train_test = True
context_len = 512
processes = 4

if __name__ == "__main__":

    data, data_str = data_process(
        files, 
        eos_str = special_tokens[0],
        return_single_str = return_single_str,
        processes = processes
        ) # data.shape -> (2119719, 1)

    tokenizer = train_new_tokenizer_bpe(
        data = data_str,
        vocab_size = vocab_size,
        special_tokens = special_tokens,
        save_path = save_path
    ) # tokenizer object

    data_tensor = tokenize(
        data = data,
        tokenizer = tokenizer,
        flat_tensor = True
    ) # List[torch.Tensor]

    X_train, y_train = create_sequences(
        data_tensor = data_tensor, 
        context_len = context_len,
        create_train_test = create_train_test,
        )

    torch.save(X_train, f = 'data/tensors/X_train')
    torch.save(y_train, f = 'data/tensors/y_train')

    data, data_str = data_process(
        files, 
        eos_str = '<|endoftext|>',
        return_single_str = return_single_str
        )  

    data_tensor = tokenize(
        data = data,
        tokenizer = tokenizer
    )

    X_val, y_val = create_sequences(
        data_tensor = data_tensor, 
        context_len = context_len,
        create_train_test = create_train_test,
        )

    torch.save(X_val, f = 'data/tensors/X_val')
    torch.save(y_val, f = 'data/tensors/y_val')

Requirements

  • torch
  • pandas
  • tqdm
  • tokenizers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinytok-0.1.0.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinytok-0.1.0-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file tinytok-0.1.0.tar.gz.

File metadata

  • Download URL: tinytok-0.1.0.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for tinytok-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8776f6e094f4bf91efba86c80302b2a049018594a3e6b7a01451acb6b0ca87b6
MD5 4086e466e6080445c62fe768875d3b10
BLAKE2b-256 e2b7a7a5ce360d1abe010b4c81cfd53fcdc183fd1c0f76455efb081cee7307e5

See more details on using hashes here.

File details

Details for the file tinytok-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tinytok-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for tinytok-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0fe15adeaa10d0319156ab0f717234fccf5a655b5b15775138e81e0dbb1134a2
MD5 903d95b5da42a526b52eca00e92874ef
BLAKE2b-256 5e002cfbddc019a4ddd87b28840f76a5cfa8ceeb547de4cfbe8eb578fa8f21d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page