Tensorflow libs: layers, metrics, ops, etc.
Project description
MLable
Tensorflow libs:
Installation
The package is available on pypi:
pip install -U mlable
Layers
Divide
Relative reshaping layers that divides a given axis and multiplies another by the same factor:
import mlable.layers.reshaping
__x = tf.ones(shape=(2, 4, 6, 8))
__l = mlable.layers.reshaping.Divide(
input_axis=2, # relative to the NEW shape / rank
output_axis=-1, # same
factor=3,
insert=False,) # whether to create a new axis
list(__l(__x).shape)
# [2, 4, 2, 24]
Merge
Relative reshaping layers that merges two axes:
import mlable.layers.reshaping
__x = tf.ones(shape=(2, 4, 6, 8))
__l = mlable.layers.reshaping.Merge(
left_axis=1,
right_axis=-1,
left=False,) # whether to merge into the left axis
list(__l(__x).shape)
# [2, 6, 32]
TokunEmbedding
These embeddings are made from the combination of elementary embeddings.
The layer inherits from keras.layers.Embedding
.
It expects a tensor with a shape following the structure:
- axis
-2
: sequence axis, with dimensionS / T
- axis
-1
: token axis, with dimensionT
The T
values in the token axis are the indexes of the embeddings to be combined.
Typically, these are byte values:
import mlable.layers.embedding
__x = tf.random.uniform((128, 1024, 16), minval=0, maxval=256, dtype=int32)
__l = mlable.layers.embedding.TokunEmbedding(
input_dim=256,
output_dim=128,)
list(__l(__x).shape)
# [128, 1024, 2048]
And the output tensor has a shape (..., S / T, T * E)
, where T * E = H
is the embedding dimension inside the LLM.
In the above example, it is set to 2048.
RotaryPositionalEmbedding
Tensorflow implementation of RoPE:
import mlable.layers.embedding
__x = tf.ones(shape=(2, 3, 5))
__l = mlable.layers.embedding.RotaryPositionalEmbedding(
sequence_axis=1, # position along this axis
feature_axis=-1, # output axis
max_wavelength=10_000, # see the paper
scaling_factor=1.) # see the paper
__l(inputs=__x, offset=2) # the offset is typically used to perform iterative decoding during inference
CachedMultiHeadAttention
This layer subclasses the regular MultiHeadAttention and adds a cache.
It has the same parameters:
import mlable.layers.transformer
mlable.layers.transformer.CachedMultiHeadAttention(
num_heads,
key_dim,
value_dim=None,
dropout=0.0,
use_bias=True,
output_shape=None,
attention_axes=None,
kernel_initializer='glorot_uniform',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
**kwargs)
And its call
function has the following arguments:
mlable.layers.transformer.CachedMultiHeadAttention.call(
query,
value,
key=None,
cache=None,
step=None,
training=False,
attention_mask=None,
return_attention_scores=False,
use_causal_mask=True,)
FeedForwardGate
A typical feed-forward layer with GELU activation:
import mlable.layers.transformer
__x = tf.ones(shape=(2, 3, 5), dtype=tf.dtypes.float32)
__l = mlable.layers.transformer.FeedForwardGate(
input_dim=256,
hidden_dim=1024)
__l(__x)
Metrics
Hierarchical models should not be scored on individual predictions but on their combination.
For example, tokun is a byte level autoencoder.
It predicts probabilities for each byte of the output, like 0 in the UTF-32-BE encoding of "a" (0, 0, 0, 97)
.
A prediction of (0, 0, 0, 98)
for "a" has 3 correct byte out of 4, but the prediction is actually "b".
In this case the byte accuracy is 75% while the character accuracy is 0%. Having several hierarchies of scoring helps with training and evaluation.
The individual predictions are evaluated in groups forming logical entities. These predictions can be in binary, categorical or raw formats. Each of these formats has a dedicated metric.
BinaryGroupAccuracy
Arguments:
group
: the number of elementary predictions that must be correct to predict a higher level entitydepth
: the dimension of the binary embedding for each predicted valuethreshold
: probabilities below the threshold are scored as0
and above1
import mlable.metrics
byte_accuracy = mlable.metrics.BinaryGroupAccuracy(group=1, depth=8, threshold=0.6, name='byte_accuracy')
character_accuracy = mlable.metrics.BinaryGroupAccuracy(group=4, depth=8, threshold=0.6, name='character_accuracy')
token_accuracy = mlable.metrics.BinaryGroupAccuracy(group=64, depth=8, threshold=0.6, name='token_accuracy')
CategoricalGroupAccuracy
Arguments:
group
: the number of elementary predictions that must be correct to predict a higher level entity
import mlable.metrics
byte_accuracy = mlable.metrics.CategoricalGroupAccuracy(group=1, name='byte_accuracy')
character_accuracy = mlable.metrics.CategoricalGroupAccuracy(group=4, name='character_accuracy')
token_accuracy = mlable.metrics.CategoricalGroupAccuracy(group=64, name='token_accuracy')
RawGroupAccuracy
Arguments:
group
: the number of elementary predictions that must be correct to predict a higher level entityfactor
: scaling factor, typically from a probability distribution to a numeric value
import mlable.metrics
byte_accuracy = mlable.metrics.RawGroupAccuracy(group=1, factor=256.0, name='byte_accuracy')
character_accuracy = mlable.metrics.RawGroupAccuracy(group=4, factor=256.0, name='character_accuracy')
token_accuracy = mlable.metrics.RawGroupAccuracy(group=64, factor=256.0, name='token_accuracy')
Credits
Andrej Karpathy reconnected my ML synapses with micrograd.
License
Licensed under the aGPLv3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.