A simple neural network library written in Python

These details have not been verified by PyPI

Project description

Enilnets Library Documentation

A pure NumPy-based deep learning library with support for dense, convolutional, pooling, batch normalization, dropout, layer normalization, embedding, upsampling, global pooling, and sparse layers. Includes multiple optimizers, loss functions, activation functions, weight initialization methods, learning rate schedulers, reinforcement learning (REINFORCE, PPO, Actor-Critic), and a full generative AI framework.

Quick Start
Installation & Project Structure
Core Architecture
Layer Types
Forward Pass
Backward Pass
Optimizers
Loss Functions
Activation Functions
Weight Initialization
Training Utilities
Model Utilities
Reinforcement Learning
Generative AI Framework
Sampling Utilities
Model I/O
Known Limitations
Version History

Quick Start

Discriminative Example

Build and train a classifier on flat data:

from Enilnets import NeuralNet, LRScheduler
import numpy as np

model = NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.01)
model.add_dense(784, 256, activation="relu")
model.add_dropout(0.3)
model.add_dense(256, 10, activation="softmax")

X_train = np.random.randn(1000, 784)
Y_train = np.eye(10)[np.random.randint(0, 10, 1000)]

# With learning rate scheduler
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=50)
history = model.Train(X_train, Y_train, epochs=50, batch_size=32, scheduler=scheduler)

Generative Example (VAE)

Train a Variational Autoencoder on image-like data:

from Enilnets import VAE
import numpy as np

vae = VAE(input_dim=784, latent_dim=32,
          encoder_hidden=[512, 256], decoder_hidden=[256, 512],
          learning_rate=0.001, optimizer="adam")

X_train = np.random.rand(1000, 784)
history = vae.Train(X_train, epochs=20, batch_size=64)
generated = vae.generate(n_samples=16)

Reinforcement Learning (PPO)

Train a policy network with Proximal Policy Optimization:

from Enilnets import NeuralNet
import numpy as np

policy = NeuralNet(learning_rate=3e-4, optimizer="adam")
policy.add_dense(4, 64, activation="tanh")
policy.add_dense(64, 2, activation="softmax")

# states, actions, old_log_probs, advantages from environment
policy.PPO(states, actions, old_log_probs, advantages, action_type="discrete")

Installation & Project Structure

The library is organized as a Python package with the following module layout:

Enilnets/
|-- __init__.py          # Package entry point: exports NeuralNet, LRScheduler, generative classes
|-- base.py              # NeuralNet class definition + method binding
|-- layers.py            # Layer factory functions (add_dense, add_conv2d, etc.)
|-- forward.py           # Forward pass implementation + im2col + normalization
|-- backward.py          # Backpropagation for all layer types
|-- optimizer.py         # Gradient update rules (SGD, Adam, RMSprop, Adagrad)
|-- loss.py              # Loss function implementations
|-- activations.py       # Activation functions and their derivatives
|-- weight_init.py       # Weight initialization strategies
|-- train.py             # Training loop, metrics, LRScheduler
|-- io.py                # Model save/load (JSON & Pickle)
|-- reinforce.py         # RL algorithms: Evolve, REINFORCE, PPO, ActorCritic
|-- generative/
|   |-- __init__.py      # Exports all generative classes and utilities
|   |-- vae.py           # Variational Autoencoder
|   |-- gan.py           # Generative Adversarial Network
|   |-- diffusion.py     # Denoising Diffusion Probabilistic Model
|   |-- autoregressive.py # MADE-style autoregressive model
|   |-- flows.py         # RealNVP normalizing flow
|   |-- ebm.py           # Energy-Based Model
|   |-- unet.py          # UNet architecture for diffusion
|   |-- sampling.py      # Sampling utilities (reparameterize, Gumbel, etc.)
|   |-- generative_loss.py # Loss functions for generative models

All layer addition methods, forward/backward passes, optimizers, loss functions, training methods, I/O, and RL methods are dynamically bound to the NeuralNet class at import time via monkey-patching in base.py. This allows each submodule to remain focused while the user interacts with a single unified API.

Core Architecture

NeuralNet Class Overview

The NeuralNet class in base.py is the central abstraction. It stores everything needed to define, train, and evaluate a neural network entirely in NumPy.

Attribute	Type	Description
`layers`	`list[dict]`	Layer definitions with weights, biases, and hyperparameters. Each layer is a dictionary containing its type-specific parameters.
`learning_rate`	`float`	Global learning rate used by all optimizers.
`optimizer_type`	`str`	Optimizer name: `"sgd"`, `"rmsprop"`, `"adagrad"`, `"adam"`.
`l2_lambda`	`float`	L2 regularization coefficient applied to weight gradients.
`momentum`	`float`	Momentum coefficient for SGD optimizer.
`outputs`	`list[ndarray]`	Cached layer outputs during the most recent forward pass. `outputs[0]` is the input, `outputs[i]` is the output of layer `i-1`.
`pre_activations`	`list[ndarray]`	Cached pre-activation values (z = Wx + b) for layers that have activations. Used during backprop for computing derivatives.
`batchnorm_cache`	`list`	BatchNorm statistics cache storing `(x, x_norm, mean, var, gamma, epsilon, axes)` for each BatchNorm layer during training.
`layernorm_cache`	`list`	LayerNorm statistics cache storing `(x, x_norm, mean, var, gamma, epsilon, axes)` for each LayerNorm layer.
`deltas`	`list[ndarray]`	Gradient error terms per layer, computed during backpropagation.
`opt_state`	`list[dict]`	Optimizer state (momentum, velocity, squared gradients) for each trainable layer. Lazily initialized on first `update()` call.
`t`	`int`	Global timestep counter, incremented on every `update()` call. Used for Adam bias correction.
`training`	`bool`	Training mode flag. Affects BatchNorm, Dropout, and layer behavior.

Internal Data Flow

Layer Definition: The user calls add_* methods which append dictionaries to self.layers. Each dictionary stores weights, biases, and type-specific metadata.
Forward Pass: Forward(inputs) iterates through self.layers, computes each layer's output, and caches results in self.outputs, self.pre_activations, self.batchnorm_cache, and self.layernorm_cache.
Loss Computation: ComputeLoss(output, target) computes the scalar loss value.
Backward Pass: Backward(targets) computes error gradients (self.deltas) by propagating from the output layer back to the input, using cached pre-activations and layer-specific backward functions.
Parameter Update: update() computes weight/bias gradients from self.deltas and self.outputs, applies L2 regularization, and updates parameters using the chosen optimizer.

Training vs Evaluation Mode

Training mode (training=True, set via .train()): BatchNorm uses batch statistics and updates running averages. Dropout randomly zeros neurons. All caches are populated.
Evaluation mode (training=False, set via .eval()): BatchNorm uses running statistics (no cache). Dropout is disabled (identity pass). Caches are not populated.

Layer Types

Dense Layer

A fully connected (affine) layer: output = activation(W @ input + b).

model.add_dense(n_in, n_out, activation="relu", init_method="xavier_uniform", use_bias=True)

Parameter	Type	Default	Description
`n_in`	`int`	required	Number of input features.
`n_out`	`int`	required	Number of output features (neurons).
`activation`	`str`	`"relu"`	Activation function name (see Activation Functions).
`init_method`	`str`	`"xavier_uniform"`	Weight initialization strategy (see Weight Initialization).
`use_bias`	`bool`	`True`	Whether to include a bias vector. If `False`, bias is zeros and not updated.

Stored in layer dict: "type": "dense", "weights" (shape (n_out, n_in)), "bias" (shape (n_out,)), "activation", "use_bias".

Sparse Layer

A dense layer with a fixed random connectivity mask. Only a fraction of weights are non-zero and trainable.

model.add_sparse(n_in, n_out, connectivity=0.5, activation="relu", init_method="xavier_uniform")

Parameter	Type	Default	Description
`n_in`	`int`	required	Number of input features.
`n_out`	`int`	required	Number of output features.
`connectivity`	`float`	`0.5`	Fraction of weights to keep (0 to 1). A mask is generated randomly and fixed for the layer's lifetime.
`activation`	`str`	`"relu"`	Activation function name.
`init_method`	`str`	`"xavier_uniform"`	Weight initialization strategy.

Stored in layer dict: "type": "sparse", "weights", "bias", "mask" (binary matrix, same shape as weights), "activation". During forward and backward passes, the mask is applied to zero out masked weights. During updates, gradients are also masked.

Convolutional Layer (Conv2D)

A 2D convolution with no padding (valid convolution). Uses im2col for efficient matrix multiplication.

model.add_conv2d(in_ch, out_ch, k, activation="relu", init_method="he_normal", stride=1)

Parameter	Type	Default	Description
`in_ch`	`int`	required	Number of input channels.
`out_ch`	`int`	required	Number of output channels (filters).
`k`	`int`	required	Kernel size (square kernel `k x k`).
`activation`	`str`	`"relu"`	Activation function name.
`init_method`	`str`	`"he_normal"`	Weight initialization strategy.
`stride`	`int`	`1`	Stride (stored but currently only stride=1 is fully supported by `im2col`).

Input shape: (batch, in_ch, H, W)
Output shape: (batch, out_ch, H-k+1, W-k+1)
Stored in layer dict: "type": "conv2d", "weights" (shape (out_ch, in_ch, k, k)), "bias" (shape (out_ch,)), "in_ch", "out_ch", "k", "activation", "stride".

Note: There is no padding support. Each conv2d layer reduces spatial dimensions by k-1 on each side.

Flatten Layer

Reshapes a multi-dimensional tensor into a 2D matrix (batch, -1).

model.add_flatten()

Stored in layer dict: "type": "flatten". No parameters. Used to transition from conv layers to dense layers.

Max Pooling 2D

Downsamples by taking the maximum value in each p x p non-overlapping window.

model.add_maxpool2d(pool_size=2)

Parameter	Type	Default	Description
`pool_size`	`int`	`2`	Size of the pooling window.

Input shape: (batch, C, H, W)
Output shape: (batch, C, H//p, W//p) (dimensions are truncated to multiples of p).
Stored in layer dict: "type": "maxpool2d", "p".

Backward pass: Uses a strided view to identify maxima and distributes gradients only to the max positions within each window.

Average Pooling 2D

Downsamples by taking the mean value in each p x p non-overlapping window.

model.add_avgpool2d(pool_size=2)

Parameter	Type	Default	Description
`pool_size`	`int`	`2`	Size of the pooling window.

Input shape: (batch, C, H, W)
Output shape: (batch, C, H//p, W//p).
Stored in layer dict: "type": "avgpool2d", "p".

Backward pass: Distributes gradient evenly across all positions in each p x p window.

Global Average Pooling 2D

Computes the mean across spatial dimensions (H, W), reducing (batch, C, H, W) to (batch, C, 1, 1).

model.add_global_avgpool2d()

Stored in layer dict: "type": "globalavgpool2d". No parameters.

Backward pass: Distributes the incoming gradient evenly across all spatial positions.

Upsampling 2D

Nearest-neighbor upsampling by repeating rows and columns.

model.add_upsample2d(scale_factor=2)

Parameter	Type	Default	Description
`scale_factor`	`int`	`2`	Factor by which to repeat each spatial dimension.

Input shape: (batch, C, H, W)
Output shape: (batch, C, H*scale, W*scale).
Stored in layer dict: "type": "upsample2d", "scale_factor".

Backward pass: Sums gradients from the repeated positions back to the original positions.

Batch Normalization

Normalizes activations across the batch dimension. Supports 2D (batch, features) and 4D (batch, C, H, W) inputs.

model.add_batchnorm(num_features, epsilon=1e-5, momentum=0.1)

Parameter	Type	Default	Description
`num_features`	`int`	required	Number of features/channels to normalize.
`epsilon`	`float`	`1e-5`	Small constant for numerical stability.
`momentum`	`float`	`0.1`	Momentum for updating running statistics. `running_stat = (1-momentum)running_stat + momentumbatch_stat`.

Stored in layer dict: "type": "batchnorm", "num_features", "epsilon", "momentum", "running_mean", "running_var", "gamma" (scale, initialized to 1), "beta" (shift, initialized to 0).

Training: Computes batch mean and variance, normalizes, applies gamma/beta, updates running statistics, and stores a cache for backward.
Evaluation: Uses running mean and variance, no cache stored.
Backward: Computes gradients w.r.t. input, gamma, and beta using the cached statistics.

Layer Normalization

Normalizes across the feature dimension(s) independently for each sample. Supports 2D and 4D inputs.

model.add_layernorm(normalized_shape, epsilon=1e-5)

Parameter	Type	Default	Description
`normalized_shape`	`int` or `tuple`	required	Shape of the features to normalize. For 2D: an int (number of features). For 4D: a tuple like `(C, H, W)`.
`epsilon`	`float`	`1e-5`	Small constant for numerical stability.

Stored in layer dict: "type": "layernorm", "normalized_shape", "epsilon", "gamma", "beta".

Unlike BatchNorm, LayerNorm has no running statistics. It always computes mean and variance on the fly. The backward pass computes dx, dgamma, and dbeta.

Dropout

Randomly zeros a fraction of activations during training for regularization.

model.add_dropout(rate=0.5)

Parameter	Type	Default	Description
`rate`	`float`	`0.5`	Fraction of neurons to drop (set to 0). Must be in `[0, 1)`.

Stored in layer dict: "type": "dropout", "rate", "mask" (binary mask created during forward pass in training mode).

Training: Each element is kept with probability 1-rate, and surviving elements are scaled by 1/(1-rate) (inverted dropout). The mask is stored for backward.
Evaluation: Identity pass, no masking.
Backward: Multiplies incoming gradient by the stored mask and the same scaling factor.

Embedding Layer

A lookup table that maps integer token indices to dense vectors.

model.add_embedding(vocab_size, embed_dim, init_method="normal")

Parameter	Type	Default	Description
`vocab_size`	`int`	required	Number of unique tokens in the vocabulary.
`embed_dim`	`int`	required	Dimension of each embedding vector.
`init_method`	`str`	`"normal"`	Initialization strategy.

Stored in layer dict: "type": "embedding", "weights" (shape (vocab_size, embed_dim)), "vocab_size", "embed_dim".

Input: Integer array of shape (batch, seq_len) or (batch,) (1D is reshaped to (batch, 1)).
Output: Embedding vectors of shape (batch, seq_len, embed_dim).
Backward: Sparse gradient -- only the rows corresponding to seen indices are updated. The _last_input key stores the input indices for gradient computation.

Forward Pass

Input Handling

The Forward(self, inputs, training=False, dropout_rate=0.0) method handles input normalization:

1D input (features,) -> reshaped to (1, features) (single sample batch).
3D input (C, H, W) -> reshaped to (1, C, H, W) (single image batch).
2D input (batch, features) and 4D input (batch, C, H, W) are used as-is.

All inputs are cast to np.float64 for numerical stability.

Layer-by-Layer Computation

The forward pass iterates through self.layers in order. For each layer:

Dense/Sparse: z = x @ W.T + b, then x = activation(z). Pre-activation z is cached.
Conv2D: Uses im2col to unfold the input into columns, performs matrix multiplication with flattened kernels, reshapes back, adds bias, then applies activation.
Flatten: Reshapes to (batch, -1).
MaxPool2D: Strided view into p x p blocks, takes max along the block axes.
AvgPool2D: Same strided view, takes mean.
GlobalAvgPool2D: Mean over axes (2, 3) with keepdims=True.
Upsample2D: x.repeat(scale, axis=2).repeat(scale, axis=3).
BatchNorm: Normalizes using batch stats (training) or running stats (eval), then applies gamma * x_norm + beta.
LayerNorm: Normalizes per-sample using feature stats, then applies gamma * x_norm + beta.
Dropout: Random mask with inverted scaling during training; identity during eval.
Embedding: Integer index lookup into the weight matrix.

After each layer, the output is appended to self.outputs, pre-activations to self.pre_activations, and normalization caches to self.batchnorm_cache / self.layernorm_cache.

im2col for Convolutions

The im2col(input_data, filter_h, filter_w, stride=1, pad=0) function converts image batches into column matrices suitable for efficient matrix multiplication:

Pads the input with zeros if pad > 0.
Uses NumPy's as_strided to create a view where each receptive field is a row.
Transposes and reshapes to (N * out_h * out_w, C * filter_h * filter_w).

This allows convolutions to be computed as a single large matrix multiplication: output = col @ W_flat.T, which is significantly faster than nested loops in pure NumPy.

Backward Pass

Automatic Delta Computation

Backward(self, targets=None, output_delta=None) supports two modes:

Mode 1: Automatic (targets provided)

model.Backward(ys)

If the last layer uses "softmax" activation, the delta is computed as (out - targets) / batch_size (the combined softmax + cross-entropy gradient simplification).
Otherwise, delta = (out - targets) * derivative(activation, pre_activation) / batch_size.

Mode 2: Manual (output_delta provided)

model.Backward(None, output_delta=custom_delta)

Used in reinforcement learning and generative models where the output gradient is computed externally.
output_delta is reshaped to (batch, features) if 1D.

Per-Layer Gradient Propagation

After computing the output delta, the backward pass iterates from the second-to-last layer back to the first:

For each layer l (current) and l+1 (next), it computes the error err flowing into layer l based on the next layer's type:

Next Layer Type	Error Computation
`dense` / `sparse`	`err = next_delta @ W_next`
`flatten`	`err = next_delta.reshape(outputs[l+1].shape)`
`conv2d`	`err = conv2d_backward_input(next_delta, W_next, outputs[l+1].shape)` -- transposed convolution via im2col
`maxpool2d`	`err = maxpool2d_backward(next_delta, outputs[l+1], pool_size)` -- routes gradient to max positions
`avgpool2d`	`err = avgpool2d_backward(next_delta, outputs[l+1], pool_size)` -- distributes gradient evenly
`globalavgpool2d`	`err = globalavgpool2d_backward(next_delta, outputs[l+1])` -- distributes over spatial dims
`upsample2d`	`err = upsample2d_backward(next_delta, outputs[l+1], scale)` -- sums repeated positions
`dropout`	`err = next_delta * mask / (1 - rate)` (or identity if mask is None)
`batchnorm`	`err, dgamma, dbeta = batchnorm_backward(next_delta, cache)` -- stores `d_gamma`, `d_beta` on the layer dict
`layernorm`	`err, dgamma, dbeta = layernorm_backward(next_delta, cache)` -- stores `d_gamma`, `d_beta` on the layer dict
`embedding`	`err = next_delta` (gradient flows back to previous layer)

Then, if the current layer has an activation (dense, sparse, conv2d), the error is multiplied by the activation derivative evaluated at the pre-activation: self.deltas[l] = err * derivative(activation, pre_activation).

Important: For BatchNorm and LayerNorm, the backward pass requires that Forward(training=True) was called first to populate the caches. If a cache is None, a ValueError is raised.

Optimizers

All optimizers are implemented in optimizer.py and applied in the update() method. The optimizer is selected via the optimizer parameter in NeuralNet.__init__().

SGD with Momentum

model = NeuralNet(optimizer="sgd", learning_rate=0.01, momentum=0.9)

Update rule for weights (same structure for biases, gamma, beta):

velocity = momentum * velocity - learning_rate * gradient
weight += velocity

Momentum accumulates velocity in the direction of persistent gradients, helping escape shallow local minima and accelerating convergence in consistent gradient directions.

RMSprop

model = NeuralNet(optimizer="rmsprop", learning_rate=0.001)

Update rule:

v = 0.999 * v + 0.001 * gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)

RMSprop adapts the learning rate per parameter by dividing by a running average of squared gradients. This helps with non-stationary objectives and sparse gradients.

Adagrad

model = NeuralNet(optimizer="adagrad", learning_rate=0.01)

Update rule:

v += gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)

Adagrad accumulates all historical squared gradients. It performs larger updates for infrequent parameters and smaller updates for frequent ones. Note that the learning rate naturally decays over time.

Adam with Bias Correction

model = NeuralNet(optimizer="adam", learning_rate=0.001)

Adam combines momentum (first moment) and RMSprop (second moment) with bias correction:

m = 0.9 * m + 0.1 * gradient        # first moment
v = 0.999 * v + 0.001 * gradient^2  # second moment
m_hat = m / (1 - 0.9^t)              # bias-corrected first moment
v_hat = v / (1 - 0.999^t)            # bias-corrected second moment
weight -= learning_rate * m_hat / (sqrt(v_hat) + 1e-8)

Where t is the global timestep incremented on each update() call. Bias correction prevents the initial estimates from being biased toward zero.

L2 Regularization

L2 regularization (weight decay) is applied to all weight gradients before the optimizer step:

grad_w = grad_w + l2_lambda * weights * mask

The mask term ensures that sparse layers only regularize their active connections. L2 regularization penalizes large weights, encouraging simpler models and reducing overfitting.

Optimizer state initialization: On the first call to update(), self.opt_state is lazily initialized with zero-initialized momentum/velocity buffers matching the shape of each layer's trainable parameters.

Loss Functions

All loss functions are implemented in loss.py via ComputeLoss(self, output, target, function="mse", reduction="mean", **kwargs).

Regression Losses

MSE (Mean Squared Error)

model.ComputeLoss(output, target, function="mse", reduction="mean")

loss = (output - target)^2

Standard regression loss. Penalizes large errors quadratically.

MAE (Mean Absolute Error)

model.ComputeLoss(output, target, function="mae", reduction="mean")

loss = |output - target|

More robust to outliers than MSE since errors are not squared.

Huber Loss

model.ComputeLoss(output, target, function="huber", delta=1.0, reduction="mean")

loss = 0.5 * diff^2              if diff < delta
loss = delta * (diff - 0.5*delta)  otherwise

Combines MSE for small errors and MAE for large errors. delta controls the transition point.

Smooth L1 Loss

model.ComputeLoss(output, target, function="smooth_l1", reduction="mean")

Huber loss with delta=1.0 hardcoded. Commonly used in object detection.

Classification Losses

Binary Cross-Entropy

model.ComputeLoss(output, target, function="binary_cross_entropy", reduction="mean")

loss = -(target * log(output) + (1 - target) * log(1 - output))

For binary classification with sigmoid output. Output is clipped to [1e-12, 1-1e-12] for numerical stability.

Cross-Entropy / Categorical Cross-Entropy

model.ComputeLoss(output, target, function="cross_entropy", reduction="mean")

loss = -target * log(output)

For multi-class classification with softmax output. Output clipped to [1e-12, 1.0]. Supports "mean", "sum", and "none" (per-element) reduction.

Note: When reduction="mean", the loss is divided by the batch size. When "sum", it is summed. When "none", the raw per-element loss array is returned.

Focal Loss

model.ComputeLoss(output, target, function="focal", alpha=0.25, gamma=2.0, reduction="mean")

Down-weights easy examples and focuses on hard examples:

pt = output * target + (1 - output) * (1 - target)
loss = -(alpha * target * (1-pt)^gamma * log(output) + (1-alpha) * (1-target) * pt^gamma * log(1-output))

Useful for imbalanced datasets. alpha balances positive/negative examples, gamma focuses on hard examples.

Hinge Loss

model.ComputeLoss(output, target, function="hinge", reduction="mean")

loss = max(0, 1 - target * output)

For SVM-style classification. Target should be +1 or -1.

BCE with Logits (Numerically Stable)

model.ComputeLoss(output, target, function="bce_logits", reduction="mean")

loss = max(output, 0) - output * target + log(1 + exp(-|output|))

Computes binary cross-entropy directly from logits (pre-sigmoid values) without explicitly computing sigmoid, avoiding numerical issues for extreme values.

Advanced Losses

Wasserstein Loss

model.ComputeLoss(output, target, function="wasserstein", reduction="mean")

loss = -output * target

Used in Wasserstein GANs. Target is +1 for real and -1 for fake.

Cosine Similarity Loss

model.ComputeLoss(output, target, function="cosine_similarity", reduction="mean")

loss = 1 - cos(output, target)

Measures the cosine distance between vectors. Useful for embedding learning and contrastive tasks.

Triplet Loss

model.ComputeLoss(output, target, function="triplet", margin=1.0, negative=neg_samples, reduction="mean")

d_pos = ||anchor - positive||^2
d_neg = ||anchor - negative||^2
loss = max(0, d_pos - d_neg + margin)

output is the anchor, target is the positive, and negative kwarg provides the negative samples. Used in metric learning (e.g., FaceNet).

NT-Xent (Normalized Temperature-scaled Cross Entropy)

model.ComputeLoss(output, target, function="ntxent", temperature=0.5, reduction="mean")

SimCLR contrastive loss. Computes pairwise cosine similarities, applies temperature scaling, and uses a cross-entropy formulation where positive pairs are on the diagonal.

Generative Losses

KL Divergence (for VAE)

model.ComputeLoss(output, target, function="kl_divergence", mu=mu, logvar=logvar, reduction="mean")

loss = -0.5 * sum(1 + logvar - mu^2 - exp(logvar), axis=-1)

KL divergence between the approximate posterior q(z|x) and the prior N(0, I). Used as a regularization term in VAEs.

Activation Functions

All activations and their derivatives are implemented in activations.py.

ReLU Family

Name	Forward	Derivative	Notes
`relu`	`max(0, x)`	`1 if x > 0 else 0`	Most common default.
`leakyrelu`	`x if x > 0 else 0.01*x`	`1 if x > 0 else 0.01`	Small negative slope prevents dying ReLU.
`elu`	`x if x > 0 else exp(x)-1`	`1 if x > 0 else exp(x)`	Smooth negative region, mean closer to zero.
`selu`	`scale * (x if x>0 else alpha*(exp(x)-1))`	`scale * (1 if x>0 else alpha*exp(x))`	Self-normalizing; `alpha=1.6733`, `scale=1.0507`.

Sigmoid Family

Name	Forward	Derivative	Notes
`sigmoid`	`1 / (1 + exp(-x))`	`sigmoid(x) * (1 - sigmoid(x))`	Clipped to `[-500, 500]` to prevent overflow.
`tanh`	`tanh(x)`	`1 - tanh(x)^2`	Zero-centered output.
`softmax`	`exp(x - max(x)) / sum(exp(x - max(x)))`	Handled specially in backward	Numerically stable via max subtraction.
`softplus`	`log(1 + exp(x))`	`sigmoid(x)`	Smooth approximation of ReLU.

Advanced Activations

Name	Forward	Derivative	Notes
`gelu`	`0.5x(1 + tanh(sqrt(2/pi)(x + 0.044715x^3)))`	`CDF + x*PDF`	Used in Transformer architectures.
`swish`	`x * sigmoid(x)`	`sigmoid(x) + xsigmoid(x)(1-sigmoid(x))`	Self-gated, smooth.
`mish`	`x * tanh(log(1 + exp(x)))`	`tanh(sp) + xsigmoid(x)(1-tanh(sp)^2)`	`sp = softplus(x)`. Smooth and self-regularizing.

Linear / Identity

Name	Forward	Derivative	Notes
`linear`	`x`	`1`	No transformation. Used for output layers before softmax/sigmoid.

Weight Initialization

All initializers are in weight_init.py and automatically called by layer addition methods.

Dense & Sparse Layer Initializers

For a layer with n_in inputs and n_out outputs:

Method	Formula	Best For
`xavier_uniform`	`U(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))`	Tanh/sigmoid activations
`xavier_normal`	`N(0, sqrt(2/(n_in+n_out)))`	Tanh/sigmoid activations
`he_uniform`	`U(-sqrt(6/n_in), sqrt(6/n_in))`	ReLU activations
`he_normal`	`N(0, sqrt(2/n_in))`	ReLU activations (default for conv)
`normal`	`N(0, 0.1)`	General purpose, small initial values
`orthogonal`	SVD-based orthogonal matrix	RNNs, preserving gradient norms
`zeros`	All zeros	Biases, or when you want to start from zero
`ones`	All ones	Special cases

Convolutional Layer Initializers

Same methods as dense, but fan-in is computed as in_ch * k * k (number of input connections per filter):

Method	Formula
`xavier_uniform`	`U(-sqrt(6/(in_chkk + out_ch)), ...)`
`xavier_normal`	`N(0, sqrt(2/(in_chkk + out_ch)))`
`he_uniform`	`U(-sqrt(6/(in_chkk)), ...)`
`he_normal`	`N(0, sqrt(2/(in_chkk)))` (default)
`normal`	`N(0, 0.1)`
`orthogonal`	SVD on `(out_ch, in_chkk)`, reshaped to `(out_ch, in_ch, k, k)`
`zeros`	All zeros
`ones`	All ones

Embedding Layer Initializers

Method	Formula
`normal`	`N(0, 0.1)` (default)
`xavier_uniform`	`U(-sqrt(6/(vocab_size + embed_dim)), ...)`
`xavier_normal`	`N(0, sqrt(2/(vocab_size + embed_dim)))`
`zeros`	All zeros

Note: All weights and biases are stored as np.float64 for maximum numerical precision.

Training Utilities

Train Method

The Train method in train.py provides a complete training loop with validation support, metric tracking, and learning rate scheduling.

history = model.Train(
    X_train, Y_train,
    epochs=10, batch_size=32,
    X_val=None, Y_val=None,
    loss_function=None,
    verbose=True,
    scheduler=None,
    **loss_kwargs
)

Parameter	Type	Default	Description
`X_train`	`ndarray`	required	Training inputs.
`Y_train`	`ndarray`	required	Training targets.
`epochs`	`int`	`10`	Number of training epochs.
`batch_size`	`int`	`32`	Batch size for mini-batch gradient descent.
`X_val`	`ndarray` or `None`	`None`	Validation inputs. If provided, validation metrics are computed each epoch.
`Y_val`	`ndarray` or `None`	`None`	Validation targets.
`loss_function`	`str` or `None`	`None`	Loss function name. If `None`, auto-detects based on last layer activation (`"cross_entropy"` for softmax, `"mse"` otherwise).
`verbose`	`bool`	`True`	Whether to print progress.
`scheduler`	`LRScheduler` or `None`	`None`	Learning rate scheduler instance.
`**loss_kwargs`			Additional arguments passed to `ComputeLoss` (e.g., `delta` for Huber loss).

Returns: history dict with keys "loss", "accuracy", "val_loss", "val_accuracy", "lr".

Training loop details:

If a scheduler is provided, scheduler.step(epoch) is called at the start of each epoch to update the learning rate.
Training data is shuffled each epoch.
Batches are processed sequentially. For each batch:
- TrainBatch is called (forward, loss, backward, update).
- Loss and accuracy are weighted by actual batch size (handles last incomplete batch).
Epoch averages are computed and stored in history.
If validation data is provided, the model runs in eval mode for validation (though Forward is called with training=False, the caches are not used for updates).

TrainBatch Method

loss, out = model.TrainBatch(xs, ys, loss_function=None, **loss_kwargs)

A single training step that:

Calls Forward(xs, training=True)
Auto-detects loss function if not provided
Computes loss via ComputeLoss
Calls Backward(ys)
Calls update() to apply gradients

Returns the scalar loss and the network output.

Learning Rate Schedulers

The LRScheduler class in train.py supports multiple decay strategies:

from Enilnets import LRScheduler

# Step decay: halve LR every 10 epochs
scheduler = LRScheduler(initial_lr=0.001, mode="step", drop=0.5, epochs_drop=10)

# Exponential decay: multiply by 0.95 each epoch
scheduler = LRScheduler(initial_lr=0.001, mode="exponential", decay=0.95)

# Cosine annealing
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=100)

# Warmup + cosine
scheduler = LRScheduler(initial_lr=0.001, mode="warmup_cosine", max_epochs=100, warmup_epochs=5)

Mode	Formula	Parameters
`"step"`	`lr * drop^(epoch // epochs_drop)`	`drop=0.5`, `epochs_drop=10`
`"exponential"`	`lr * decay^epoch`	`decay=0.95`
`"cosine"`	`lr * 0.5 * (1 + cos(pi * epoch / max_epochs))`	`max_epochs=100`
`"warmup_cosine"`	Linear warmup then cosine	`max_epochs=100`, `warmup_epochs=5`
`"plateau"`	Returns initial_lr (placeholder)	None

The scheduler's step(epoch) method returns the learning rate for that epoch. The Train method calls self.set_lr(lr) before each epoch.

Metrics Computation

Accuracy

acc = model.compute_accuracy(predictions, targets)

Multi-class (predictions.shape[-1] > 1): Compares argmax of predictions vs targets.
Binary (predictions.shape[-1] == 1): Thresholds at 0.5.

Returns mean accuracy as a float.

Precision, Recall, F1

metrics = model.compute_precision_recall_f1(predictions, targets)
# Returns: {"precision": float, "recall": float, "f1": float}

Binary classification metrics. Uses the same multi-class/binary detection as accuracy. Computed with 1e-12 epsilon to prevent division by zero.

Model Utilities

Training / Evaluation Mode

model.train()   # Set training=True, returns self (for chaining)
model.eval()    # Set training=False, returns self

These affect BatchNorm (batch vs running stats) and Dropout (active vs identity).

Learning Rate Control

model.set_lr(0.0001)   # Set learning rate
lr = model.get_lr()    # Get current learning rate

Gradient Clipping

model.clip_gradients(max_norm=1.0)

Clips the L2 norm of all deltas across all layers:

Computes total_norm = sqrt(sum(||delta||^2)) over all non-None deltas.
If total_norm > max_norm, scales all deltas by max_norm / total_norm.

Call this after Backward() and before update() to prevent exploding gradients.

Layer Freezing / Unfreezing

model.freeze()           # Freeze all layers
model.freeze(2)          # Freeze layer index 2 only
model.unfreeze()         # Unfreeze all layers
model.unfreeze(2)        # Unfreeze layer index 2 only

Frozen layers are skipped during update(). The _frozen flag is checked in the optimizer loop. This is useful for transfer learning and fine-tuning.

Weight Get / Set

weights = model.get_weights()   # Returns list of dicts, one per layer
model.set_weights(weights)      # Restores weights from list of dicts

get_weights() copies "weights", "bias", "gamma", "beta", and "mask" for each layer. set_weights() restores them. Useful for checkpointing, model averaging, and transfer.

Model Copying

net_copy = model.copy()

Creates a deep copy of the network including layers and optimizer state. The new network has the same architecture, weights, and optimizer buffers but is an independent object.

Optimizer State Reset

model.reset_optimizer_state()

Clears all optimizer momentum/velocity buffers and resets the timestep t to 0. Useful when starting training on a new dataset or after significant hyperparameter changes.

NaN / Inf Detection

issues = model.check_nan_inf()
# Returns list of strings like ["Layer 2 weights has NaN/Inf", "Delta 5 has NaN/Inf"]

Checks all weights, biases, gamma, beta, and deltas for non-finite values. Returns an empty list if everything is clean. Call this periodically during training to catch numerical instability early.

Model Summary

model.summary()

Prints a formatted table showing:

Optimizer, learning rate, L2 lambda
Per-layer information: type, input/output shapes, parameter counts
Total parameter count

Example output:

Model Summary
======================================================================
Optimizer: ADAM | LR: 0.001 | L2: 0.01
======================================================================
Layer 0: DENSE        Input:    784 Output:    256 Params: 200960
Layer 1: DROPOUT
Layer 2: DENSE        Input:    256 Output:     10 Params: 2570
Total Parameters: 203530
======================================================================

Reinforcement Learning

All RL methods are in reinforce.py and bound to NeuralNet.

Evolutionary Strategy (Evolve)

best_score = model.Evolve(inputs, score_fn, noise=0.05, tries=10, sigma=1.0)

A black-box optimization method that perturbs network weights with Gaussian noise and keeps the best variant.

Parameter	Type	Default	Description
`inputs`	`ndarray`	required	Input data to evaluate the network on.
`score_fn`	`callable`	required	Function that takes network output and returns a scalar score (higher is better).
`noise`	`float`	`0.05`	Standard deviation scale for weight perturbations.
`tries`	`int`	`10`	Number of candidate networks to try.
`sigma`	`float`	`1.0`	Additional scaling factor for noise.

Algorithm:

Evaluate the current network on inputs to get a baseline score.
For each try, create a deep copy of the network, add Gaussian noise to all weights and biases (respecting sparse masks), evaluate the candidate.
If the candidate scores higher, keep it as the new best.
Restore the best network.

Returns the best score achieved.

REINFORCE (Policy Gradient)

mean_return = model.Reinforce(
    states, actions, returns,
    action_type="discrete", std=1.0, normalize_returns=True
)

Monte-Carlo policy gradient method.

Parameter	Type	Default	Description
`states`	`ndarray`	required	Observed states, shape `(N, features)`.
`actions`	`ndarray`	required	Discrete: `(N,)` integer indices. Continuous: `(N, action_dim)`.
`returns`	`ndarray`	required	Discounted returns for each state-action pair, shape `(N,)` or `(N, 1)`.
`action_type`	`str`	`"discrete"`	`"discrete"` (categorical) or `"continuous"` (Gaussian).
`std`	`float`	`1.0`	Standard deviation for continuous Gaussian policy.
`normalize_returns`	`bool`	`True`	Whether to z-score normalize returns before computing gradients.

Discrete actions: Network output is treated as action probabilities. The gradient is (out - one_hot(actions)) * returns / batch_size.
Continuous actions: Network output is treated as action means. The gradient is -(actions - means) / std^2 * returns / batch_size.

Returns the mean of the raw (un-normalized) returns.

Proximal Policy Optimization (PPO)

policy_loss = model.PPO(
    states, actions, old_log_probs, advantages,
    action_type="discrete", epsilon=0.2, std=1.0,
    value_targets=None, value_coeff=0.5, entropy_coeff=0.01
)

PPO is a policy gradient method that clips the objective to prevent overly large policy updates.

Parameter	Type	Default	Description
`states`	`ndarray`	required	Observed states, shape `(N, features)`.
`actions`	`ndarray`	required	Discrete: `(N,)` integers. Continuous: `(N, action_dim)`.
`old_log_probs`	`ndarray`	required	Log probabilities under the old policy, shape `(N, 1)`.
`advantages`	`ndarray`	required	Advantage estimates, shape `(N, 1)`.
`action_type`	`str`	`"discrete"`	`"discrete"` or `"continuous"`.
`epsilon`	`float`	`0.2`	Clipping parameter.
`std`	`float`	`1.0`	Fixed std for continuous Gaussian policy.
`value_targets`	`ndarray` or `None`	`None`	Target values for value head (not yet implemented in gradient).
`value_coeff`	`float`	`0.5`	Coefficient for value loss (reserved).
`entropy_coeff`	`float`	`0.01`	Coefficient for entropy bonus (encourages exploration).

Discrete PPO:

Computes action probabilities from network output.
Computes log probabilities for the taken actions.
Computes probability ratio ratio = exp(new_log_prob - old_log_prob).
Computes clipped surrogate objective: min(ratio * advantage, clip(ratio, 1-eps, 1+eps) * advantage).
Approximates policy gradient: for each sample, if not clipped, gradient flows through the taken action proportional to -advantage / prob.
Adds entropy gradient: (1 + log_probs) * entropy_coeff / batch_size.

Continuous PPO: Uses Gaussian log probabilities and computes gradient as -(actions - means) / std^2 * advantages / batch_size.

Returns the mean policy loss (negative of the clipped objective).

Actor-Critic

value_loss = model.ActorCritic(
    states, actions, returns, values,
    action_type="discrete", std=1.0
)

Combines policy gradient with a value function baseline.

Parameter	Type	Default	Description
`states`	`ndarray`	required	Observed states.
`actions`	`ndarray`	required	Taken actions.
`returns`	`ndarray`	required	Discounted returns, shape `(N, 1)`.
`values`	`ndarray`	required	Predicted values from value network, shape `(N, 1)`.
`action_type`	`str`	`"discrete"`	`"discrete"` or `"continuous"`.
`std`	`float`	`1.0`	Std for continuous actions.

Algorithm:

Computes advantages: advantages = returns - values.
Uses the same policy gradient as REINFORCE but weighted by advantages instead of raw returns.
Returns the mean squared advantage (a proxy for value function error).

Note: This implementation uses a single network. In practice, you may want separate actor and critic networks or a network with two output heads.

RL Utility Functions

compute_returns

from Enilnets import compute_returns

returns = compute_returns(rewards, gamma=0.99)

Computes discounted returns for a single episode:

G_t = reward_t + gamma * G_{t+1}

Iterates backwards through the reward array. Returns an array of the same shape.

gae (Generalized Advantage Estimation)

from Enilnets.generative.sampling import gae

advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)

Parameter	Type	Default	Description
`rewards`	`ndarray`	required	Step rewards, shape `(T,)`.
`values`	`ndarray`	required	Value estimates including bootstrap, shape `(T+1,)`.
`gamma`	`float`	`0.99`	Discount factor.
`lambda_`	`float`	`0.95`	GAE lambda (0 = high bias, 1 = high variance).

Computes TD-residuals delta_t = reward_t + gamma * V(s_{t+1}) - V(s_t), then accumulates them with exponential decay: A_t = delta_t + gamma * lambda * A_{t+1}. Returns (advantages, returns) where returns = advantages + values[:T].

Generative AI Framework

All generative models are in the generative/ subpackage and imported from the top-level Enilnets package.

Variational Autoencoder (VAE)

File: generative/vae.py

A VAE learns a probabilistic latent representation of data. It consists of an encoder (maps data to latent distribution parameters) and a decoder (maps latent samples back to data).

from Enilnets import VAE

vae = VAE(
    input_dim=784, latent_dim=32,
    encoder_hidden=[512, 256],
    decoder_hidden=[256, 512],
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)

Parameter	Type	Default	Description
`input_dim`	`int`	required	Dimensionality of input data (flattened).
`latent_dim`	`int`	required	Dimensionality of the latent space.
`encoder_hidden`	`list[int]`	`[512, 256]`	Hidden layer sizes for the encoder.
`decoder_hidden`	`list[int]`	`[256, 512]`	Hidden layer sizes for the decoder.
`activation`	`str`	`"swish"`	Activation for hidden layers.
`learning_rate`	`float`	`0.001`	Learning rate for both encoder and decoder.
`optimizer`	`str`	`"adam"`	Optimizer type.
`l2_lambda`	`float`	`0.0`	L2 regularization.

Architecture:

Encoder: Dense layers with specified activation, final layer outputs latent_dim * 2 values (mu and logvar) with linear activation.
Decoder: Dense layers with specified activation, final layer outputs input_dim values with sigmoid activation (assumes data in [0, 1]).

Methods:

Method	Signature	Description
`encode`	`encode(x)` -> `(mu, logvar)`	Maps input to latent distribution parameters.
`decode`	`decode(z)` -> `recon`	Maps latent samples to reconstructed data.
`forward`	`forward(x)` -> `(recon, mu, logvar, z)`	Full forward pass through encoder + reparameterization + decoder.
`loss`	`loss(x, recon=None, mu=None, logvar=None)` -> `float`	Computes reconstruction loss (binary cross-entropy) + KL divergence.
`train_step`	`train_step(x)` -> `float`	One training step: forward, backward through decoder, backward through encoder, update both networks.
`Train`	`Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]`	Full training loop. Returns list of average losses per epoch.
`generate`	`generate(n_samples=1)` -> `ndarray`	Samples from N(0, I) in latent space and decodes.
`reconstruct`	`reconstruct(x)` -> `ndarray`	Encodes and decodes input (reconstruction).
`interpolate`	`interpolate(x1, x2, n_steps=10)` -> `ndarray`	Linear interpolation in latent space between two inputs.

Training details:

Forward: encode -> reparameterize -> decode.
Decoder backward: computes d_recon = (recon - x) / batch_size, multiplies by sigmoid derivative recon * (1 - recon), backpropagates through decoder, updates decoder weights.
Encoder backward: computes gradient of latent samples w.r.t. decoder input, combines with reparameterization gradients to get d_mu and d_logvar, backpropagates through encoder, updates encoder weights.
Loss = binary cross-entropy reconstruction + KL(q(z|x) || N(0, I)).

Generative Adversarial Network (GAN)

File: generative/gan.py

A GAN trains a generator to produce realistic data and a discriminator to distinguish real from fake.

from Enilnets import GAN

gan = GAN(
    latent_dim=100, data_dim=784,
    generator_hidden=[256, 512],
    discriminator_hidden=[512, 256],
    g_activation="swish", d_activation="leakyrelu",
    loss_type="bce",
    learning_rate=0.0002, optimizer="adam", l2_lambda=0.0
)

Parameter	Type	Default	Description
`latent_dim`	`int`	required	Dimensionality of the noise vector.
`data_dim`	`int`	required	Dimensionality of generated data.
`generator_hidden`	`list[int]`	`[256, 512]`	Hidden layer sizes for generator.
`discriminator_hidden`	`list[int]`	`[512, 256]`	Hidden layer sizes for discriminator.
`g_activation`	`str`	`"swish"`	Generator hidden activation.
`d_activation`	`str`	`"leakyrelu"`	Discriminator hidden activation.
`loss_type`	`str`	`"bce"`	`"bce"`, `"bce_logits"`, or `"wasserstein"`.
`learning_rate`	`float`	`0.0002`	LR for both networks.
`optimizer`	`str`	`"adam"`	Optimizer type.
`l2_lambda`	`float`	`0.0`	L2 regularization.

Architecture:

Generator: Dense layers with g_activation, final layer with tanh activation.
Discriminator: Dense layers with d_activation, final layer with sigmoid (for BCE) or linear (for Wasserstein).

Methods:

Method	Signature	Description
`generate`	`generate(n_samples)` -> `ndarray`	Samples noise and runs generator forward.
`discriminate`	`discriminate(x)` -> `ndarray`	Runs discriminator forward.
`Train`	`Train(X_train, epochs=10, batch_size=64, d_steps=1, g_steps=1, verbose=True)` -> `dict`	Alternates discriminator and generator training.
`sample`	`sample(n_samples=16)` -> `ndarray`	Alias for `generate`.

Loss types:

Type	Discriminator Target	Generator Gradient
`"bce"`	Real=1, Fake=0	`-1 / D(fake)`
`"bce_logits"`	Real=1, Fake=0	Logits-based stable gradient
`"wasserstein"`	Real=1, Fake=-1	`-1` (constant)

Training loop:

For each batch, train discriminator for d_steps iterations on real + fake data.
Train generator for g_steps iterations by backpropagating through the discriminator to get gradients w.r.t. generator input.
Track and report D_loss and G_loss per epoch.

Diffusion Model (DDPM)

File: generative/diffusion.py

Implements Denoising Diffusion Probabilistic Models (DDPM). The model learns to reverse a gradual noising process.

from Enilnets import DiffusionModel

diffusion = DiffusionModel(
    data_shape=(784,), time_steps=1000,
    beta_schedule="linear", beta_start=1e-4, beta_end=0.02,
    denoiser_type="mlp", denoiser_hidden=[512, 512, 512],
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)

Parameter	Type	Default	Description
`data_shape`	`tuple`	required	Shape of data. `(D,)` for flattened, `(C, H, W)` for images.
`time_steps`	`int`	`1000`	Number of diffusion timesteps.
`beta_schedule`	`str`	`"linear"`	`"linear"` or `"cosine"` noise schedule.
`beta_start`	`float`	`1e-4`	Starting beta value (linear schedule).
`beta_end`	`float`	`0.02`	Ending beta value (linear schedule).
`denoiser_type`	`str`	`"mlp"`	`"mlp"` or `"conv"` denoiser architecture.
`denoiser_hidden`	`list[int]`	`[512, 512, 512]`	Hidden sizes for MLP denoiser.
`learning_rate`	`float`	`0.001`	Learning rate.
`optimizer`	`str`	`"adam"`	Optimizer type.
`l2_lambda`	`float`	`0.0`	L2 regularization.

Noise schedules:

Linear: betas = linspace(beta_start, beta_end, time_steps)
Cosine: Uses a cosine-squared schedule with offset s=0.008 for smoother noise addition.

Precomputed constants (computed in __init__):

alphas = 1 - betas
alphas_cumprod = cumprod(alphas) -- cumulative product of alphas
sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod -- for forward diffusion
sqrt_recip_alphas, posterior_variance -- for reverse diffusion

Methods:

Method	Signature	Description
`train_step`	`train_step(x_0)` -> `float`	One training step: sample timestep t, add noise, predict noise, compute MSE loss, backpropagate.
`Train`	`Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]`	Full training loop.
`sample`	`sample(n_samples=16, shape=None, clip=True)` -> `ndarray`	Generate samples by iteratively denoising from pure noise.
`denoise`	`denoise(x_noisy, t_start, t_end=0)` -> `ndarray`	Denoise a partially noised input from timestep `t_start` down to `t_end`.

Denoiser architectures:

MLP: Concatenates flattened input with sinusoidal time embedding, passes through dense layers.
Conv: Stack of conv2d layers. Time embedding is broadcast spatially and added to feature maps.

Forward diffusion: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise

Reverse diffusion: At each step, predict noise, compute mean of p(x_{t-1} | x_t), add scaled noise (except at t=0).

Autoregressive Model (MADE)

File: generative/autoregressive.py

A masked autoregressive model that enforces causality -- each output dimension only depends on previous dimensions.

from Enilnets import AutoregressiveModel

ar = AutoregressiveModel(
    data_dim=784, hidden_dims=[512, 512],
    data_shape=(28, 28), activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)

Parameter	Type	Default	Description
`data_dim`	`int`	required	Total number of dimensions.
`hidden_dims`	`list[int]`	`[512, 512]`	Hidden layer sizes.
`data_shape`	`tuple` or `None`	`None`	Original shape for reshaping output (e.g., `(28, 28)`).
`activation`	`str`	`"swish"`	Hidden activation.
`learning_rate`	`float`	`0.001`	Learning rate.
`optimizer`	`str`	`"adam"`	Optimizer type.
`l2_lambda`	`float`	`0.0`	L2 regularization.

Architecture: Standard MLP with linear output activation.

Causality enforcement: A lower-triangular mask (with zeros on and above diagonal) is applied to the input before feeding it to the network: x_masked[i] = sum_{j < i} mask[i,j] * x[j]. This ensures the i-th output only sees dimensions 0 through i-1.

Methods:

Method	Signature	Description
`forward`	`forward(x, training=True)` -> `ndarray`	Causal forward pass.
`loss`	`loss(x)` -> `float`	MSE between predictions and targets.
`train_step`	`train_step(x)` -> `float`	One training step with custom backpropagation.
`Train`	`Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]`	Full training loop.
`generate`	`generate(n_samples=1, shape=None)` -> `ndarray`	Autoregressive sampling: predict dim 0, use it to predict dim 1, etc.
`complete`	`complete(partial_x, n_dims=None)` -> `ndarray`	Complete a partial sample by autoregressively filling remaining dimensions.

Generation: Starts with zeros, iteratively predicts each dimension and adds small Gaussian noise for diversity.

Normalizing Flows (RealNVP)

File: generative/flows.py

Real-valued Non-Volume Preserving (RealNVP) flow for density estimation and sampling.

from Enilnets import RealNVP

flow = RealNVP(
    data_dim=784, n_coupling=4, hidden_dim=256,
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)

Parameter	Type	Default	Description
`data_dim`	`int`	required	Dimensionality of data.
`n_coupling`	`int`	`4`	Number of coupling layers.
`hidden_dim`	`int`	`256`	Hidden dimension for s and t networks.
`activation`	`str`	`"swish"`	Activation for coupling networks.
`learning_rate`	`float`	`0.001`	Learning rate.
`optimizer`	`str`	`"adam"`	Optimizer type.
`l2_lambda`	`float`	`0.0`	L2 regularization.

Architecture: Each coupling layer has two networks:

s_net (scale): Predicts log-scale factors. Output activation is tanh.
t_net (translation): Predicts translation. Output activation is linear.

Both are 3-layer MLPs: data_dim//2 -> hidden_dim -> hidden_dim -> data_dim - data_dim//2.

Coupling transform: For input split into x1 and x2:

y2 = x2 * exp(s(x1)) + t(x1)
output = concat(x1, y2)
log_det += sum(s(x1))

Alternating masks (even/odd splits) ensure all dimensions are transformed.

Methods:

Method	Signature	Description
`forward`	`forward(x)` -> `(z, log_det)`	Maps data to latent space. Returns transformed data and log determinant.
`inverse`	`inverse(z)` -> `x`	Maps latent samples back to data space.
`log_prob`	`log_prob(x)` -> `ndarray`	Computes log probability: `log p(z) + log_det` where `p(z)` is N(0, I).
`loss`	`loss(x)` -> `float`	Negative mean log probability.
`train_step`	`train_step(x)` -> `float`	Returns loss (training uses Evolve, see below).
`Train`	`Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]`	Trains each coupling layer sequentially using evolutionary strategy.
`sample`	`sample(n_samples=1)` -> `ndarray`	Samples from base distribution and applies inverse transform.
`interpolate`	`interpolate(x1, x2, n_steps=10)` -> `ndarray`	Linear interpolation in latent space.

Training: Uses Evolve (evolutionary strategy) rather than analytical backprop through the log-determinant Jacobian. Each coupling layer is trained while keeping previous layers fixed.

Energy-Based Model (EBM)

File: generative/ebm.py

An energy-based model that assigns low energy to real data and high energy to generated samples.

from Enilnets import EnergyBasedModel

ebm = EnergyBasedModel(
    data_dim=784, hidden_dims=[512, 512],
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)

Parameter	Type	Default	Description
`data_dim`	`int`	required	Dimensionality of data.
`hidden_dims`	`list[int]`	`[512, 512]`	Hidden layer sizes for energy network.
`activation`	`str`	`"swish"`	Hidden activation.
`learning_rate`	`float`	`0.001`	Learning rate.
`optimizer`	`str`	`"adam"`	Optimizer type.
`l2_lambda`	`float`	`0.0`	L2 regularization.

Architecture: MLP mapping data to a scalar energy value. Final layer has linear activation.

Methods:

Method	Signature	Description
`energy`	`energy(x)` -> `ndarray`	Computes scalar energy for input data, shape `(batch, 1)`.
`_energy_grad`	`_energy_grad(x)` -> `(energy, grad)`	Finite-difference gradient of energy w.r.t. input (eps=1e-4).
`train_step`	`train_step(x_data, n_cd_steps=10, step_size=0.1, noise_scale=0.005)` -> `float`	One contrastive divergence step.
`Train`	`Train(X_train, epochs=10, batch_size=64, n_cd_steps=10, step_size=0.1, noise_scale=0.005, verbose=True)` -> `list[float]`	Full training loop.
`sample`	`sample(n_samples=1, n_steps=100, step_size=0.1, noise_scale=0.005)` -> `ndarray`	Langevin dynamics sampling from random initialization.
`score`	`score(x)` -> `ndarray`	Returns the energy gradient (score function).

Training (Contrastive Divergence):

Generate negative samples by running Langevin dynamics from random noise for n_cd_steps iterations.
Compute energy on real data and negative samples.
Update network to push down energy on real data (target=1) and push up on negative samples (target=-1).
Loss = mean(energy(data) - energy(negative_samples)).

Sampling: Langevin dynamics iterates x = x - step_size * grad(energy) + noise_scale * N(0, I).

UNet Denoiser

File: generative/unet.py

A UNet architecture for spatial denoising, designed for use with diffusion models.

from Enilnets import UNetDenoiser, time_embedding

unet = UNetDenoiser(
    in_ch=1, base_ch=64, time_emb_dim=128,
    ch_mult=(1, 2, 4)
)

Parameter	Type	Default	Description
`in_ch`	`int`	required	Number of input channels.
`base_ch`	`int`	`64`	Base number of channels.
`time_emb_dim`	`int`	`128`	Dimensionality of time embedding.
`ch_mult`	`tuple`	`(1, 2, 4)`	Channel multipliers for each encoder level.

Architecture:

Time embedding MLP: time_emb_dim -> time_emb_dim*4 -> time_emb_dim*4 with swish activation.
Encoder path: len(ch_mult) levels. Each level has 2 conv2d layers (k=1) with swish activation. Time embedding is added to features (broadcast spatially if channel dims match).
Downsampling: Average pooling by factor 2 between encoder levels.
Bottleneck: 2 conv2d layers at the deepest level.
Decoder path: Mirrors encoder with skip connections. Upsamples, concatenates with skip, applies 2 conv2d layers.
Output: Conv2d (k=1) mapping back to in_ch channels with linear activation.

Methods:

Method	Signature	Description
`forward`	`forward(x, t)` -> `ndarray`	Full UNet forward pass with time conditioning.
`backward`	`backward(grad_output)`	Raises `NotImplementedError`.
`get_params`	`get_params()` -> `list[NeuralNet]`	Returns all sub-networks for external optimization.

Important: The UNet uses k=1 convolutions to avoid spatial dimension changes (since the library has no padding support). The backward method is not implemented; use DiffusionModel with denoiser_type="mlp" for fully trainable diffusion, or implement custom backpropagation for the UNet.

time_embedding function:

emb = time_embedding(t, dim, max_period=10000)

Sinusoidal time embedding used in diffusion models:

freqs = exp(-log(max_period) * arange(dim//2) / (dim//2))
emb = [sin(t * freqs), cos(t * freqs)]

If dim is odd, a zero column is appended.

Sampling Utilities

File: generative/sampling.py

VAE Reparameterization

from Enilnets import reparameterize

z = reparameterize(mu, logvar)

The reparameterization trick: z = mu + exp(0.5 * logvar) * eps where eps ~ N(0, I).

Enables backpropagation through stochastic nodes by separating the randomness from the parameters.

Langevin Dynamics

from Enilnets import langevin_dynamics

x_sampled = langevin_dynamics(energy_fn, x_init, n_steps=20, step_size=0.1, noise_scale=0.005)

Langevin Monte Carlo for sampling from energy-based models:

for step in range(n_steps):
    energy, grad = energy_fn(x)
    x = x - step_size * grad + noise_scale * N(0, I)

Parameter	Type	Default	Description
`energy_fn`	`callable`	required	Function taking `x` and returning `(energy, grad_energy)`.
`x_init`	`ndarray`	required	Initial samples.
`n_steps`	`int`	`20`	Number of Langevin steps.
`step_size`	`float`	`0.1`	Gradient descent step size.
`noise_scale`	`float`	`0.005`	Standard deviation of injected noise.

Gaussian & Uniform Sampling

from Enilnets import gaussian_sample, uniform_sample

# Gaussian: N(mean, std^2)
samples = gaussian_sample(mean, std, shape=None)
# If shape is None, uses mean.shape

# Uniform: U(low, high)
samples = uniform_sample(low, high, shape)

Gumbel-Softmax

from Enilnets import gumbel_softmax_sample

samples = gumbel_softmax_sample(logits, temperature=1.0, hard=False)

Differentiable sampling from categorical distributions:

Add Gumbel noise: y = logits + Gumbel(0, 1)
Apply temperature-scaled softmax: y_soft = softmax(y / temperature)
If hard=True, use straight-through estimator: y_hard - y_soft + y_soft (discrete forward, continuous backward).

Useful for training models with discrete latent variables.

Random Masking

from Enilnets import random_mask

mask = random_mask(shape, ratio)

Generates a binary mask where each element is 1 with probability ratio (keep ratio), 0 otherwise. Returns np.float64 array.

Top-p (Nucleus) Sampling

from Enilnets.generative.sampling import top_p_sampling

samples = top_p_sampling(logits, p=0.9, temperature=1.0)

Nucleus sampling for text generation:

Convert logits to probabilities with temperature scaling.
Sort probabilities in descending order.
Find the smallest set of tokens whose cumulative probability exceeds p.
Sample from this truncated distribution.

Always keeps at least the top token. Returns one-hot encoded samples.

Discounted Returns

from Enilnets import compute_returns

returns = compute_returns(rewards, gamma=0.99)

Computes cumulative discounted returns for a single episode:

G_t = reward_t + gamma * G_{t+1}

Iterates backward through the reward array. Returns array of same shape.

Generalized Advantage Estimation (GAE)

from Enilnets.generative.sampling import gae

advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)

Parameter	Type	Default	Description
`rewards`	`ndarray`	required	Step rewards, shape `(T,)`.
`values`	`ndarray`	required	Value estimates including bootstrap `V(s_{T+1})`, shape `(T+1,)`.
`gamma`	`float`	`0.99`	Discount factor.
`lambda_`	`float`	`0.95`	GAE lambda parameter.

Algorithm:

for t in reversed(range(T)):
    delta = rewards[t] + gamma * values[t+1] - values[t]
    gae_t = delta + gamma * lambda_ * gae_t
    advantages[t] = gae_t
returns = advantages + values[:T]

GAE provides a bias-variance tradeoff controlled by lambda_: lambda_=0 gives high-bias TD(0), lambda_=1 gives high-variance Monte Carlo.

Model I/O

File: io.py

JSON Serialization

model.Save("model.json")

Saves the model to a human-readable JSON file. All NumPy arrays are converted to Python lists via a custom encoder. Stores:

version: 3
layers: All layer dictionaries
optimizer: Optimizer type string
learning_rate, l2_lambda, momentum: Hyperparameters
t: Global timestep

Note: JSON does not preserve NumPy array types; arrays are stored as nested lists.

Pickle Serialization

model.Save("model.pkl")

Saves the model to a binary pickle file. Preserves NumPy arrays exactly. More compact and faster than JSON. Detected automatically by .pkl extension.

Loading Models

model = NeuralNet()  # Create a new instance
model.Load("model.json")  # or "model.pkl"

The Load method:

Detects file format by extension (.pkl for pickle, otherwise JSON).
Loads the payload.
Reconstructs layers, converting list data back to np.float64 arrays for all weight-related keys (weights, bias, mask, gamma, beta, running_mean, running_var).
Restores hyperparameters (learning_rate, optimizer_type, l2_lambda, momentum, t).
Resets optimizer state (opt_state = []) -- you may want to call reset_optimizer_state() after loading.

Important: The loaded model does not restore optimizer momentum buffers. If you need to resume training exactly, you would need to save and load opt_state separately (not currently supported).

Known Limitations

No padding in Conv2D: All convolutions use pad=0, so spatial dimensions shrink by k-1 per layer. The UNet works around this by using k=1 convolutions.
UNet backward not implemented: The UNetDenoiser.backward() method raises NotImplementedError. Use DiffusionModel with denoiser_type="mlp" for fully trainable diffusion, or implement custom backpropagation.
Flows use evolutionary training: RealNVP uses Evolve (black-box optimization) rather than analytical backprop through the log-determinant Jacobian. This is slower and less precise than gradient-based training.
GAN training can be unstable: As with all GANs, convergence depends heavily on architecture, learning rates, and data. The library provides three loss types but no spectral normalization or other advanced stabilization techniques.
No GPU acceleration: Pure NumPy implementation runs on CPU only. Large models and datasets will be slow.
Stride support is limited: While stride is stored in conv2d layers, the im2col implementation only fully supports stride=1.
No recurrent layers: No LSTM, GRU, or vanilla RNN support. For sequence modeling, use the embedding layer with dense layers or implement custom recurrence.
No automatic differentiation: Gradients are hand-coded for each layer type. Adding new layers requires implementing both forward and backward passes.
Optimizer state not saved: Save()/Load() do not persist optimizer momentum/velocity buffers. Training resumes from scratch in terms of optimizer state.
Single precision not supported: All computations use np.float64. This provides numerical stability but uses more memory than float32.

Version History

v2.0.0

Major update including:

New layer types: LayerNorm, Embedding, GlobalAvgPool2D, Upsample2D, Sparse
Learning rate schedulers: Step decay, exponential decay, cosine annealing, warmup+cosine, plateau
Reinforcement learning: REINFORCE, PPO, Actor-Critic, Evolutionary Strategy
Gradient clipping: L2 norm clipping across all layers
Layer freezing/unfreezing: Fine-grained control over which layers train
NaN/Inf detection: check_nan_inf() for debugging numerical issues
Improved BatchNorm: Full 2D and 4D support with proper running statistics
New loss functions: Cosine similarity, triplet margin, NT-Xent (SimCLR), focal loss, BCE with logits, Wasserstein loss
Generative AI framework: VAE, GAN, Diffusion Model, Autoregressive Model, RealNVP, Energy-Based Model, UNet Denoiser
Sampling utilities: Reparameterization, Langevin dynamics, Gumbel-Softmax, top-p sampling, GAE
Perceptual loss utilities: Placeholder functions for VGG-based perceptual loss
Model copying and state reset: copy() and reset_optimizer_state()

Complete API Reference

NeuralNet Methods

Method	Description
`__init__(learning_rate, optimizer, l2_lambda, momentum)`	Constructor
`summary()`	Print architecture summary
`add_dense(n_in, n_out, activation, init_method, use_bias)`	Add fully connected layer
`add_sparse(n_in, n_out, connectivity, activation, init_method)`	Add sparse connected layer
`add_conv2d(in_ch, out_ch, k, activation, init_method, stride)`	Add 2D convolution
`add_flatten()`	Add flatten layer
`add_maxpool2d(pool_size)`	Add max pooling
`add_avgpool2d(pool_size)`	Add average pooling
`add_global_avgpool2d()`	Add global average pooling
`add_upsample2d(scale_factor)`	Add 2x upsampling
`add_batchnorm(num_features, epsilon, momentum)`	Add batch normalization
`add_layernorm(normalized_shape, epsilon)`	Add layer normalization
`add_dropout(rate)`	Add dropout regularization
`add_embedding(vocab_size, embed_dim, init_method)`	Add embedding lookup layer
`Forward(inputs, training, dropout_rate)`	Forward pass
`predict(inputs)`	Alias for Forward
`train()`	Set training mode
`eval()`	Set evaluation mode
`set_lr(lr)`	Set learning rate
`get_lr()`	Get learning rate
`freeze(layer_idx)`	Freeze layer(s)
`unfreeze(layer_idx)`	Unfreeze layer(s)
`clip_gradients(max_norm)`	Clip gradient norms
`get_weights()`	Copy all weights
`set_weights(weights)`	Restore weights
`copy()`	Deep copy network
`reset_optimizer_state()`	Clear optimizer buffers
`check_nan_inf()`	Check for NaN/Inf
`Backward(targets, output_delta)`	Backpropagation
`update()`	Apply parameter updates
`TrainBatch(xs, ys, loss_function, **kwargs)`	Train one batch
`Train(X, Y, epochs, batch_size, X_val, Y_val, loss_function, verbose, scheduler, **kwargs)`	Full training loop
`ComputeLoss(out, tgt, function, reduction, **kwargs)`	Compute loss
`compute_accuracy(pred, tgt)`	Compute classification accuracy
`compute_precision_recall_f1(pred, tgt)`	Compute precision, recall, F1
`Evolve(inputs, score_fn, noise, tries, sigma)`	Evolutionary strategy
`Reinforce(states, actions, returns, action_type, std, normalize_returns)`	Policy gradient
`PPO(states, actions, old_log_probs, advantages, action_type, epsilon, std, value_targets, value_coeff, entropy_coeff)`	Proximal Policy Optimization
`ActorCritic(states, actions, returns, values, action_type, std)`	Actor-Critic
`Save(file)`	Save model to file
`Load(file)`	Load model from file

Generative Classes

Class	Module	Description
`VAE`	`generative.vae`	Variational Autoencoder
`GAN`	`generative.gan`	Generative Adversarial Network
`DiffusionModel`	`generative.diffusion`	DDPM diffusion model
`AutoregressiveModel`	`generative.autoregressive`	MADE-style autoregressive model
`RealNVP`	`generative.flows`	RealNVP normalizing flow
`EnergyBasedModel`	`generative.ebm`	Energy-based model
`UNetDenoiser`	`generative.unet`	UNet for diffusion denoising
`LRScheduler`	`train`	Learning rate scheduler

Generative Loss Functions

Function	Module	Description
`kl_divergence_gaussian(mu, logvar, reduction)`	`generative_loss`	KL(q(z\|x) \|\| N(0,I))
`adversarial_loss_discriminator(real_logits, fake_logits, loss_type)`	`generative_loss`	Discriminator loss (BCE/BCE_logits/Wasserstein)
`adversarial_loss_generator(fake_logits, loss_type)`	`generative_loss`	Generator loss
`diffusion_loss(pred_noise, true_noise, reduction)`	`generative_loss`	MSE noise prediction
`nll_loss(log_px, log_det_jacobian, reduction)`	`generative_loss`	Flow negative log-likelihood
`energy_loss(data_energy, sample_energy, margin)`	`generative_loss`	EBM contrastive loss
`perceptual_loss(x, y, feature_extractor)`	`generative_loss`	Perceptual loss (falls back to MSE)
`vgg_loss(x, y)`	`generative_loss`	Placeholder for VGG perceptual loss

Sampling Functions

Function	Module	Description
`reparameterize(mu, logvar)`	`sampling`	VAE reparameterization trick
`langevin_dynamics(energy_fn, x_init, n_steps, step_size, noise_scale)`	`sampling`	MCMC sampling for EBMs
`gaussian_sample(mean, std, shape)`	`sampling`	Gaussian sampling
`uniform_sample(low, high, shape)`	`sampling`	Uniform sampling
`gumbel_softmax_sample(logits, temperature, hard)`	`sampling`	Differentiable categorical sampling
`random_mask(shape, ratio)`	`sampling`	Random boolean mask
`top_p_sampling(logits, p, temperature)`	`sampling`	Nucleus (top-p) sampling
`compute_returns(rewards, gamma)`	`sampling`	Discounted returns
`gae(rewards, values, gamma, lambda_)`	`sampling`	Generalized Advantage Estimation

Weight Initialization Functions

Function	Description
`init_weights(n_in, n_out, method)`	Dense/sparse weight initialization
`init_conv_weights(in_ch, out_ch, k, method)`	Conv2D weight initialization
`init_embedding_weights(vocab_size, embed_dim, method)`	Embedding weight initialization

Utility Functions

Function	Description
`im2col(input_data, filter_h, filter_w, stride, pad)`	Convert images to column matrix for efficient convolution
`time_embedding(t, dim, max_period)`	Sinusoidal time embedding for diffusion models
`activate(name, x)`	Apply activation function
`derivative(name, x)`	Compute activation derivative

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.1.0

Jun 29, 2026

2.0.0

Jun 29, 2026

1.1.2

Jun 28, 2026

1.0.1

Jun 25, 2026

1.0.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enilnets-2.1.0.tar.gz (95.8 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

enilnets-2.1.0-py3-none-any.whl (57.7 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file enilnets-2.1.0.tar.gz.

File metadata

Download URL: enilnets-2.1.0.tar.gz
Upload date: Jun 29, 2026
Size: 95.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for enilnets-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ccf3348443f28e724a7777ca69ad5bc47e20ce3bf233998a3252d11c80ebf684`
MD5	`9e022ca5618743c173deb82a8f4d9d06`
BLAKE2b-256	`2cb0c57282b1a914cea629dacc67fe61275d3827780df3f5317e47b4cc4d500d`

See more details on using hashes here.

File details

Details for the file enilnets-2.1.0-py3-none-any.whl.

File metadata

Download URL: enilnets-2.1.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 57.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for enilnets-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2498d0a6065dbe2ddb631ea3217f1f37349c922f1532659ce10b8d4df769c305`
MD5	`7ef685e61115f5b578601b9ed9179401`
BLAKE2b-256	`2617e0120a4073357d46595edceb36c8d189b843fcc2203d32375b8c507bfb4c`

See more details on using hashes here.

Enilnets 2.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Enilnets Library Documentation

Table of Contents

Quick Start

Discriminative Example

Generative Example (VAE)

Reinforcement Learning (PPO)

Installation & Project Structure

Core Architecture

NeuralNet Class Overview

Internal Data Flow

Training vs Evaluation Mode

Layer Types

Dense Layer

Sparse Layer

Convolutional Layer (Conv2D)

Flatten Layer

Max Pooling 2D

Average Pooling 2D

Global Average Pooling 2D

Upsampling 2D

Batch Normalization

Layer Normalization

Dropout

Embedding Layer

Forward Pass

Input Handling

Layer-by-Layer Computation

im2col for Convolutions

Backward Pass

Automatic Delta Computation

Per-Layer Gradient Propagation

Optimizers

SGD with Momentum

RMSprop

Adagrad

Adam with Bias Correction

L2 Regularization

Loss Functions

Regression Losses

MSE (Mean Squared Error)

MAE (Mean Absolute Error)

Huber Loss

Smooth L1 Loss

Classification Losses

Binary Cross-Entropy

Cross-Entropy / Categorical Cross-Entropy

Focal Loss

Hinge Loss

BCE with Logits (Numerically Stable)

Advanced Losses

Wasserstein Loss

Cosine Similarity Loss

Triplet Loss

NT-Xent (Normalized Temperature-scaled Cross Entropy)

Generative Losses

KL Divergence (for VAE)

Activation Functions

ReLU Family

Sigmoid Family

Advanced Activations

Linear / Identity

Weight Initialization

Dense & Sparse Layer Initializers

Convolutional Layer Initializers

Embedding Layer Initializers

Training Utilities

Train Method

TrainBatch Method

Learning Rate Schedulers

Metrics Computation

Accuracy

Precision, Recall, F1