A continuously updates repository of utility functions.

These details have not been verified by PyPI

Project description

A vast range of python utils

This continuously updates with more utils that I add.

Installation

pip install ykutil
Or clone this repo and pip install -e .

Overview of all implemented utilities

Here is an overview of what utilities are implemented at this point.

Basic python tools

List Utilities

list_rindex(li, x): Find the last index of x in li.
list_split(li, max_len, min_len=None): Split li into sublists of length max_len.
split_multi(lists, max_len, progress=False, min_len=None): Split multiple lists into sublists.
list_multiply(elem_list, mul_list): Multiply elements of elem_list by corresponding elements in mul_list.
list_squeeze(l): Recursively squeeze single-element lists.
list_flip(lst): Flip list values based on max and min values.
chunk_list(lst, n): Yield consecutive chunks of size n from lst.
flatten(li): Flatten a list of lists.
unique_n_times(lst, n, invalid_filter=set(), verbose=False, comboer=None, shuffle=False): Get indices of the first n occurrences of each unique element in lst.
make_list_unique(seq): Remove duplicates from list.
all_sublist_matches(lst, sublst): Find all sublist matches.
removesuffixes(lst, suffix): Remove suffixes from a list.
approx_list_split(lst, n_splits): Split a list in n_splits parts of about equal length.

String Utilities

multify_text(text, roles): Format text with multiple roles.
naive_regex_escape(some_str): Escape regex metacharacters in a string.
str_find_all(string, sub): Find all occurrences of sub in string.
re_line_matches(string, regex): Find line matches for a regex pattern.

Dictionary Utilities

transpose_li_of_dict(lidic): Transpose a list of dictionaries.
transpose_dict_of_li(d): Transpose a dictionary of lists.
dict_percentages(d): Convert dictionary values to percentages.
recursed_dict_percentages(d): Recursively convert dictionary values to percentages.
recursed_merge_percent_stats(lst, weights=None): Merge percentage statistics recursively.
recursed_sum_up_stats(lst): Sum up statistics recursively.
dict_without(d, without): Return a dictionary without specified keys.

General Utilities

identity(x): Return x.
index_of_sublist_match(haystack, needle): Find the index of a sublist match.
nth_index(lst, value, n): Find the nth occurrence of a value in a list.
update_running_avg(old_avg, old_weight, new_avg, new_weight=1): Update a running average.
all_equal(iterable, force_value=None): Check if all elements in an iterable are equal.
approx_number_split(n, n_splits): Split a number into a list of close integers that sum ut to the number.
anyin(haystack, needles): Predicate to check if any element from needles is in haystack

Huggingface datasets utilities

Dataset Description

describe_dataset(ds, tokenizer=None, show_rows=(0, 3)): Print metadata, columns, number of rows, and example rows of a dataset.

Dataset Visualization

colorcode_dataset(dd, tk, num_start=5, num_end=6, data_key="train", fname=None, beautify=True): Color-code and print dataset entries with optional beautification.
colorcode_entry(token_ds_path, fname=None, tokenizer_path="mistralai/Mistral-7B-v0.1", num_start=0, num_end=1, beautify=True): Load a dataset from disk and color-code its entries.

Huggingface transformers utilities

Tokenization

batch_tokenization(tk, texts, batch_size, include_offsets=False, **tk_args): Tokenize large batches of texts.
tokenize_instances(tokenizer, instances): Tokenize a sequence of instances.
flat_encode(tokenizer, inputs, add_special_tokens=False): Flatten and encode inputs.
get_token_begins(encoding): Get the beginning positions of tokens.
tokenize(tk_name, text): Tokenize a text using a specified tokenizer.
untokenize(tk_name, tokens): Decode tokens using a specified tokenizer.

Generation

generate_different_sequences(model, context, sequence_bias_add, sequence_bias_decay, generation_args, num_generations): Generate different sequences with bias adjustments.
TokenStoppingCriteria(delimiter_token): Stopping criteria based on a delimiter token.

Offsets and Spans

obtain_offsets(be, str_lengths=None): Obtain offsets from a batch encoding.
transform_with_offsets(offsets, spans, include_left=True, include_right=True): Transform spans using offsets.
regex_tokens_using_offsets(offsets, text, regex, include_left=True, include_right=True): Find tokens matching a regex using offsets.

Tokenizer Utilities

find_tokens_with_str(tokenizer, string): Find tokens containing a specific string.
load_tk_with_pad_tk(model_path): Load a tokenizer and set the pad token if not set.

Data Collation

DataCollatorWithPadding: A data collator that pads sequences to the same length.

Chat Template

dict_from_chat_template(chat_template_str, tk_type="llama3"): Convert a chat template string to a dictionary.

Training and Evaluation

train_eval_and_get_metrics(trainer, checkpoint=None): Train a model, evaluate it, and return the metrics.

Model Utilities

print_trainable_parameters(model_args, model): Print the number of trainable parameters in the model.
print_parameters_by_dtype(model): Print the number of parameters by data type.
smart_tokenizer_and_embedding_resize(special_tokens_dict, tokenizer, model): Resize tokenizer and embedding.
find_all_linear_names(model, bits=32): Find all linear layer names in the model.

Torch utilities

Tensor Operations

rolling_window(a, size): Create a rolling window view of the input tensor.
find_all_subarray_poses(arr, subarr, end=False, roll_window=None): Find all positions of a subarray within an array.
tensor_in(needles, haystack): Check if elements of one tensor are in another tensor.
pad_along_dimension(tensors, dim, pad_value=0): Pad a list of tensors along a specified dimension.

Model Utilities

disable_gradients(model): Disable gradients for all parameters in a model.

Memory Management

print_memory_info(): Print CUDA memory information.
free_cuda_memory(): Free CUDA memory by collecting garbage and emptying the cache.

Device Management

get_max_memory_and_device_map(max_memory_mb): Get the maximum memory and device map for distributed settings.

Transformer Heads Utilities

Sequence Log Probability

compute_seq_log_probability(model, pre_seq_tokens, post_seq_tokens): Compute the log probability of a sequence given a model and token sequences.

LLM API Utilities

Image Utilities

local_image_to_data_url(image_path): Encode a local image into a data URL.

Message Utilities

human_readable_parse(messages): Parse messages into a human-readable format.

Model Wrappers

ModelWrapper: A wrapper class for models to handle completions and structured completions.
AzureModelWrapper: A specialized wrapper for Azure OpenAI models with cost computation.

Executable

Bulk Rename

do_bulk_rename(): Execute bulk renaming of files.

JSON Beautification

beautify_json(json_str): Beautify a JSON string.
do_beautify_json(): Command-line interface for beautifying a JSON string.

Dataset Description

describe_dataset(ds_name, tokenizer_name=None, show_rows=(0, 3)): Describe a dataset with optional tokenization.
do_describe_dataset(): Command-line interface for describing a dataset.

Tokenization

tokenize(tk, text): Tokenize a text using a specified tokenizer.
do_tokenize(): Command-line interface for tokenizing a text.

Untokenization

untokenize(tk, tokens): Decode tokens using a specified tokenizer.
do_untokenize(): Command-line interface for untokenizing tokens.

Dataset Color Coding

colorcode_entry(token_ds_path, fname=None, tokenizer_path="mistralai/Mistral-7B-v0.1", num_start=0, num_end=1, beautify=True): Load a dataset from disk and color-code its entries.
do_colorcode_dataset(): Command-line interface for color-coding a dataset.

Python Data Modelling Utilities

summed_stat_dc(datas, avg_keys=(), weight_attr="num_examples"): Summarize statistics from a list of dataclass instances.
undefaultdict(d, do_copy=False): Convert defaultdict to dict recursively.
stringify_tuple_keys(d): Convert tuple keys in a dictionary to strings.
Serializable: A base class to make dataclasses JSON serializable and hashable.
sortedtuple(sort_fun, fixed_len=None): Create a sorted tuple type with a custom sorting function.

Statistics tools

Statlogger: A class for logging and updating statistics.
Welfords: A class for Welford's online algorithm for computing mean and variance.

Pretty print tools

Object description

describe_recursive(l, types, lengths, arrays, dict_keys, depth=0): Recursively describe the structure of a list or tuple.
describe_list(l, no_empty=True): Describe the structure of a list or tuple.
describe_array(arr): Describe the properties of a numpy array or torch tensor.

Logging Utilities

add_file_handler(file_path): Add a file handler to the logger.
log(*messages, level=logging.INFO): Log messages with a specified logging level.

Miscellaneous

PEFT Utilities

load_maybe_peft_model_tokenizer(model_path, device_map="auto", quantization_config=None, flash_attn="flash_attention_2", only_inference=True): Load a PEFT model and tokenizer, with optional quantization and flash attention.

Pandera Utilities

empty_dataframe_from_model(Model): Create an empty DataFrame from a Pandera DataFrameModel.

Multiprocessing Utilities

starmap_with_kwargs(pool, fn, args_iter, kwargs_iter): Apply a function to arguments and keyword arguments in parallel using a pool.
apply_args_and_kwargs(fn, args, kwargs): Apply a function to arguments and keyword arguments.
run_in_parallel(func, list_ordered_kwargs, num_workers, extra_kwargs={}): Run a function in parallel with specified arguments and number of workers.

Constants

IGNORE_INDEX: Constant for ignore index.
DEFAULT_PAD_TOKEN: Default padding token.
SPACE_TOKENIZERS: Tuple of space tokenizers.

Configuration Utilities

from_file(cls, config_file, **argmod): Load a configuration from a file and apply modifications.

Accelerate Tools

gather_dict(d, strict=False): Gather a dictionary across multiple processes.

Types Utilities

describe_type(o): Pretty print the type of an object
T and U: TypeVars ready to use

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.20

Feb 19, 2026

0.0.19

Nov 20, 2025

0.0.18

Jun 10, 2025

0.0.17

Jun 8, 2025

0.0.16

Jun 8, 2025

0.0.14

Apr 8, 2025

0.0.13

Apr 8, 2025

0.0.12

Apr 7, 2025

0.0.11

Mar 11, 2025

0.0.10

Feb 26, 2025

0.0.8

Feb 12, 2025

This version

0.0.7

Jan 29, 2025

0.0.6

Jan 28, 2025

0.0.5

Jan 8, 2025

0.0.4

Nov 30, 2024

0.0.3

Nov 29, 2024

0.0.2

Nov 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ykutil-0.0.7.tar.gz (28.0 kB view details)

Uploaded Jan 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ykutil-0.0.7-py3-none-any.whl (34.3 kB view details)

Uploaded Jan 29, 2025 Python 3

File details

Details for the file ykutil-0.0.7.tar.gz.

File metadata

Download URL: ykutil-0.0.7.tar.gz
Upload date: Jan 29, 2025
Size: 28.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ykutil-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`7d0ba695b82b3f715b2b9398bd187e2698f449cc34187d18c486e1a157804aeb`
MD5	`b4c90483e1f9408c4dd009c6cfa8e128`
BLAKE2b-256	`ed4da3b9316f50067a11b8f814142a157bd0418b3983cb81d29757ce364f47ed`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ykutil-0.0.7.tar.gz:

Publisher: publish_to_pypi.yml on yannikkellerde/ykutil

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ykutil-0.0.7.tar.gz
- Subject digest: 7d0ba695b82b3f715b2b9398bd187e2698f449cc34187d18c486e1a157804aeb
- Sigstore transparency entry: 166637187
- Sigstore integration time: Jan 29, 2025
Source repository:
- Permalink: yannikkellerde/ykutil@7f59489ee11ce9c08aa83a0a2d0ac4e5ab2df605
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yannikkellerde
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_to_pypi.yml@7f59489ee11ce9c08aa83a0a2d0ac4e5ab2df605
- Trigger Event: workflow_dispatch

File details

Details for the file ykutil-0.0.7-py3-none-any.whl.

File metadata

Download URL: ykutil-0.0.7-py3-none-any.whl
Upload date: Jan 29, 2025
Size: 34.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ykutil-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e7cf93580e02acd4301782fe267d68b36081b186f661a446fe715e77d8de580`
MD5	`f101c638171d6a282c205ab88314bc67`
BLAKE2b-256	`e7b3a47d14225b1897743b4dfbd1334faabc40fae15a1cd43bf564e7590c44d2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ykutil-0.0.7-py3-none-any.whl:

Publisher: publish_to_pypi.yml on yannikkellerde/ykutil

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ykutil-0.0.7-py3-none-any.whl
- Subject digest: 8e7cf93580e02acd4301782fe267d68b36081b186f661a446fe715e77d8de580
- Sigstore transparency entry: 166637189
- Sigstore integration time: Jan 29, 2025
Source repository:
- Permalink: yannikkellerde/ykutil@7f59489ee11ce9c08aa83a0a2d0ac4e5ab2df605
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yannikkellerde
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_to_pypi.yml@7f59489ee11ce9c08aa83a0a2d0ac4e5ab2df605
- Trigger Event: workflow_dispatch

ykutil 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

A vast range of python utils

Installation

Overview of all implemented utilities

Basic python tools

List Utilities

String Utilities

Dictionary Utilities

General Utilities

Huggingface datasets utilities

Dataset Description

Dataset Visualization

Huggingface transformers utilities

Tokenization

Generation

Offsets and Spans

Tokenizer Utilities

Data Collation

Chat Template

Training and Evaluation

Model Utilities

Torch utilities

Tensor Operations

Model Utilities

Memory Management

Device Management

Transformer Heads Utilities

Sequence Log Probability

LLM API Utilities

Image Utilities

Message Utilities

Model Wrappers

Executable

Bulk Rename

JSON Beautification

Dataset Description

Tokenization

Untokenization

Dataset Color Coding

Python Data Modelling Utilities

Statistics tools

Pretty print tools

Object description

Logging Utilities

Miscellaneous

PEFT Utilities

Pandera Utilities

Multiprocessing Utilities

Constants

Configuration Utilities

Accelerate Tools

Types Utilities

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance