Skip to main content

A simple neural network library written in Python

Project description

Enilnets Library Documentation

A pure NumPy-based deep learning library with support for dense, convolutional, pooling, batch normalization, dropout, layer normalization, embedding, upsampling, global pooling, and sparse layers. Includes multiple optimizers, loss functions, activation functions, weight initialization methods, learning rate schedulers, reinforcement learning (REINFORCE, PPO, Actor-Critic), and a full generative AI framework.


Table of Contents

  1. Quick Start
  2. Installation & Project Structure
  3. Core Architecture
  4. Layer Types
  5. Forward Pass
  6. Backward Pass
  7. Optimizers
  8. Loss Functions
  9. Activation Functions
  10. Weight Initialization
  11. Training Utilities
  12. Model Utilities
  13. Reinforcement Learning
  14. Generative AI Framework
  15. Sampling Utilities
  16. Model I/O
  17. Known Limitations
  18. Version History

Quick Start

Discriminative Example

Build and train a classifier on flat data:

from Enilnets import NeuralNet, LRScheduler
import numpy as np

model = NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.01)
model.add_dense(784, 256, activation="relu")
model.add_dropout(0.3)
model.add_dense(256, 10, activation="softmax")

X_train = np.random.randn(1000, 784)
Y_train = np.eye(10)[np.random.randint(0, 10, 1000)]

# With learning rate scheduler
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=50)
history = model.Train(X_train, Y_train, epochs=50, batch_size=32, scheduler=scheduler)

Generative Example (VAE)

Train a Variational Autoencoder on image-like data:

from Enilnets import VAE
import numpy as np

vae = VAE(input_dim=784, latent_dim=32,
          encoder_hidden=[512, 256], decoder_hidden=[256, 512],
          learning_rate=0.001, optimizer="adam")

X_train = np.random.rand(1000, 784)
history = vae.Train(X_train, epochs=20, batch_size=64)
generated = vae.generate(n_samples=16)

Reinforcement Learning (PPO)

Train a policy network with Proximal Policy Optimization:

from Enilnets import NeuralNet
import numpy as np

policy = NeuralNet(learning_rate=3e-4, optimizer="adam")
policy.add_dense(4, 64, activation="tanh")
policy.add_dense(64, 2, activation="softmax")

# states, actions, old_log_probs, advantages from environment
policy.PPO(states, actions, old_log_probs, advantages, action_type="discrete")

Installation & Project Structure

The library is organized as a Python package with the following module layout:

Enilnets/
|-- __init__.py          # Package entry point: exports NeuralNet, LRScheduler, generative classes
|-- base.py              # NeuralNet class definition + method binding
|-- layers.py            # Layer factory functions (add_dense, add_conv2d, etc.)
|-- forward.py           # Forward pass implementation + im2col + normalization
|-- backward.py          # Backpropagation for all layer types
|-- optimizer.py         # Gradient update rules (SGD, Adam, RMSprop, Adagrad)
|-- loss.py              # Loss function implementations
|-- activations.py       # Activation functions and their derivatives
|-- weight_init.py       # Weight initialization strategies
|-- train.py             # Training loop, metrics, LRScheduler
|-- io.py                # Model save/load (JSON & Pickle)
|-- reinforce.py         # RL algorithms: Evolve, REINFORCE, PPO, ActorCritic
|-- generative/
|   |-- __init__.py      # Exports all generative classes and utilities
|   |-- vae.py           # Variational Autoencoder
|   |-- gan.py           # Generative Adversarial Network
|   |-- diffusion.py     # Denoising Diffusion Probabilistic Model
|   |-- autoregressive.py # MADE-style autoregressive model
|   |-- flows.py         # RealNVP normalizing flow
|   |-- ebm.py           # Energy-Based Model
|   |-- unet.py          # UNet architecture for diffusion
|   |-- sampling.py      # Sampling utilities (reparameterize, Gumbel, etc.)
|   |-- generative_loss.py # Loss functions for generative models

All layer addition methods, forward/backward passes, optimizers, loss functions, training methods, I/O, and RL methods are dynamically bound to the NeuralNet class at import time via monkey-patching in base.py. This allows each submodule to remain focused while the user interacts with a single unified API.


Core Architecture

NeuralNet Class Overview

The NeuralNet class in base.py is the central abstraction. It stores everything needed to define, train, and evaluate a neural network entirely in NumPy.

Attribute Type Description
layers list[dict] Layer definitions with weights, biases, and hyperparameters. Each layer is a dictionary containing its type-specific parameters.
learning_rate float Global learning rate used by all optimizers.
optimizer_type str Optimizer name: "sgd", "rmsprop", "adagrad", "adam".
l2_lambda float L2 regularization coefficient applied to weight gradients.
momentum float Momentum coefficient for SGD optimizer.
outputs list[ndarray] Cached layer outputs during the most recent forward pass. outputs[0] is the input, outputs[i] is the output of layer i-1.
pre_activations list[ndarray] Cached pre-activation values (z = Wx + b) for layers that have activations. Used during backprop for computing derivatives.
batchnorm_cache list BatchNorm statistics cache storing (x, x_norm, mean, var, gamma, epsilon, axes) for each BatchNorm layer during training.
layernorm_cache list LayerNorm statistics cache storing (x, x_norm, mean, var, gamma, epsilon, axes) for each LayerNorm layer.
deltas list[ndarray] Gradient error terms per layer, computed during backpropagation.
opt_state list[dict] Optimizer state (momentum, velocity, squared gradients) for each trainable layer. Lazily initialized on first update() call.
t int Global timestep counter, incremented on every update() call. Used for Adam bias correction.
training bool Training mode flag. Affects BatchNorm, Dropout, and layer behavior.

Internal Data Flow

  1. Layer Definition: The user calls add_* methods which append dictionaries to self.layers. Each dictionary stores weights, biases, and type-specific metadata.
  2. Forward Pass: Forward(inputs) iterates through self.layers, computes each layer's output, and caches results in self.outputs, self.pre_activations, self.batchnorm_cache, and self.layernorm_cache.
  3. Loss Computation: ComputeLoss(output, target) computes the scalar loss value.
  4. Backward Pass: Backward(targets) computes error gradients (self.deltas) by propagating from the output layer back to the input, using cached pre-activations and layer-specific backward functions.
  5. Parameter Update: update() computes weight/bias gradients from self.deltas and self.outputs, applies L2 regularization, and updates parameters using the chosen optimizer.

Training vs Evaluation Mode

  • Training mode (training=True, set via .train()): BatchNorm uses batch statistics and updates running averages. Dropout randomly zeros neurons. All caches are populated.
  • Evaluation mode (training=False, set via .eval()): BatchNorm uses running statistics (no cache). Dropout is disabled (identity pass). Caches are not populated.

Layer Types

Dense Layer

A fully connected (affine) layer: output = activation(W @ input + b).

model.add_dense(n_in, n_out, activation="relu", init_method="xavier_uniform", use_bias=True)
Parameter Type Default Description
n_in int required Number of input features.
n_out int required Number of output features (neurons).
activation str "relu" Activation function name (see Activation Functions).
init_method str "xavier_uniform" Weight initialization strategy (see Weight Initialization).
use_bias bool True Whether to include a bias vector. If False, bias is zeros and not updated.

Stored in layer dict: "type": "dense", "weights" (shape (n_out, n_in)), "bias" (shape (n_out,)), "activation", "use_bias".

Sparse Layer

A dense layer with a fixed random connectivity mask. Only a fraction of weights are non-zero and trainable.

model.add_sparse(n_in, n_out, connectivity=0.5, activation="relu", init_method="xavier_uniform")
Parameter Type Default Description
n_in int required Number of input features.
n_out int required Number of output features.
connectivity float 0.5 Fraction of weights to keep (0 to 1). A mask is generated randomly and fixed for the layer's lifetime.
activation str "relu" Activation function name.
init_method str "xavier_uniform" Weight initialization strategy.

Stored in layer dict: "type": "sparse", "weights", "bias", "mask" (binary matrix, same shape as weights), "activation". During forward and backward passes, the mask is applied to zero out masked weights. During updates, gradients are also masked.

Convolutional Layer (Conv2D)

A 2D convolution with no padding (valid convolution). Uses im2col for efficient matrix multiplication.

model.add_conv2d(in_ch, out_ch, k, activation="relu", init_method="he_normal", stride=1)
Parameter Type Default Description
in_ch int required Number of input channels.
out_ch int required Number of output channels (filters).
k int required Kernel size (square kernel k x k).
activation str "relu" Activation function name.
init_method str "he_normal" Weight initialization strategy.
stride int 1 Stride (stored but currently only stride=1 is fully supported by im2col).

Input shape: (batch, in_ch, H, W)
Output shape: (batch, out_ch, H-k+1, W-k+1)
Stored in layer dict: "type": "conv2d", "weights" (shape (out_ch, in_ch, k, k)), "bias" (shape (out_ch,)), "in_ch", "out_ch", "k", "activation", "stride".

Note: There is no padding support. Each conv2d layer reduces spatial dimensions by k-1 on each side.

Flatten Layer

Reshapes a multi-dimensional tensor into a 2D matrix (batch, -1).

model.add_flatten()

Stored in layer dict: "type": "flatten". No parameters. Used to transition from conv layers to dense layers.

Max Pooling 2D

Downsamples by taking the maximum value in each p x p non-overlapping window.

model.add_maxpool2d(pool_size=2)
Parameter Type Default Description
pool_size int 2 Size of the pooling window.

Input shape: (batch, C, H, W)
Output shape: (batch, C, H//p, W//p) (dimensions are truncated to multiples of p).
Stored in layer dict: "type": "maxpool2d", "p".

Backward pass: Uses a strided view to identify maxima and distributes gradients only to the max positions within each window.

Average Pooling 2D

Downsamples by taking the mean value in each p x p non-overlapping window.

model.add_avgpool2d(pool_size=2)
Parameter Type Default Description
pool_size int 2 Size of the pooling window.

Input shape: (batch, C, H, W)
Output shape: (batch, C, H//p, W//p).
Stored in layer dict: "type": "avgpool2d", "p".

Backward pass: Distributes gradient evenly across all positions in each p x p window.

Global Average Pooling 2D

Computes the mean across spatial dimensions (H, W), reducing (batch, C, H, W) to (batch, C, 1, 1).

model.add_global_avgpool2d()

Stored in layer dict: "type": "globalavgpool2d". No parameters.

Backward pass: Distributes the incoming gradient evenly across all spatial positions.

Upsampling 2D

Nearest-neighbor upsampling by repeating rows and columns.

model.add_upsample2d(scale_factor=2)
Parameter Type Default Description
scale_factor int 2 Factor by which to repeat each spatial dimension.

Input shape: (batch, C, H, W)
Output shape: (batch, C, H*scale, W*scale).
Stored in layer dict: "type": "upsample2d", "scale_factor".

Backward pass: Sums gradients from the repeated positions back to the original positions.

Batch Normalization

Normalizes activations across the batch dimension. Supports 2D (batch, features) and 4D (batch, C, H, W) inputs.

model.add_batchnorm(num_features, epsilon=1e-5, momentum=0.1)
Parameter Type Default Description
num_features int required Number of features/channels to normalize.
epsilon float 1e-5 Small constant for numerical stability.
momentum float 0.1 Momentum for updating running statistics. running_stat = (1-momentum)*running_stat + momentum*batch_stat.

Stored in layer dict: "type": "batchnorm", "num_features", "epsilon", "momentum", "running_mean", "running_var", "gamma" (scale, initialized to 1), "beta" (shift, initialized to 0).

Training: Computes batch mean and variance, normalizes, applies gamma/beta, updates running statistics, and stores a cache for backward.
Evaluation: Uses running mean and variance, no cache stored.
Backward: Computes gradients w.r.t. input, gamma, and beta using the cached statistics.

Layer Normalization

Normalizes across the feature dimension(s) independently for each sample. Supports 2D and 4D inputs.

model.add_layernorm(normalized_shape, epsilon=1e-5)
Parameter Type Default Description
normalized_shape int or tuple required Shape of the features to normalize. For 2D: an int (number of features). For 4D: a tuple like (C, H, W).
epsilon float 1e-5 Small constant for numerical stability.

Stored in layer dict: "type": "layernorm", "normalized_shape", "epsilon", "gamma", "beta".

Unlike BatchNorm, LayerNorm has no running statistics. It always computes mean and variance on the fly. The backward pass computes dx, dgamma, and dbeta.

Dropout

Randomly zeros a fraction of activations during training for regularization.

model.add_dropout(rate=0.5)
Parameter Type Default Description
rate float 0.5 Fraction of neurons to drop (set to 0). Must be in [0, 1).

Stored in layer dict: "type": "dropout", "rate", "mask" (binary mask created during forward pass in training mode).

Training: Each element is kept with probability 1-rate, and surviving elements are scaled by 1/(1-rate) (inverted dropout). The mask is stored for backward.
Evaluation: Identity pass, no masking.
Backward: Multiplies incoming gradient by the stored mask and the same scaling factor.

Embedding Layer

A lookup table that maps integer token indices to dense vectors.

model.add_embedding(vocab_size, embed_dim, init_method="normal")
Parameter Type Default Description
vocab_size int required Number of unique tokens in the vocabulary.
embed_dim int required Dimension of each embedding vector.
init_method str "normal" Initialization strategy.

Stored in layer dict: "type": "embedding", "weights" (shape (vocab_size, embed_dim)), "vocab_size", "embed_dim".

Input: Integer array of shape (batch, seq_len) or (batch,) (1D is reshaped to (batch, 1)).
Output: Embedding vectors of shape (batch, seq_len, embed_dim).
Backward: Sparse gradient -- only the rows corresponding to seen indices are updated. The _last_input key stores the input indices for gradient computation.


Forward Pass

Input Handling

The Forward(self, inputs, training=False, dropout_rate=0.0) method handles input normalization:

  • 1D input (features,) -> reshaped to (1, features) (single sample batch).
  • 3D input (C, H, W) -> reshaped to (1, C, H, W) (single image batch).
  • 2D input (batch, features) and 4D input (batch, C, H, W) are used as-is.

All inputs are cast to np.float64 for numerical stability.

Layer-by-Layer Computation

The forward pass iterates through self.layers in order. For each layer:

  1. Dense/Sparse: z = x @ W.T + b, then x = activation(z). Pre-activation z is cached.
  2. Conv2D: Uses im2col to unfold the input into columns, performs matrix multiplication with flattened kernels, reshapes back, adds bias, then applies activation.
  3. Flatten: Reshapes to (batch, -1).
  4. MaxPool2D: Strided view into p x p blocks, takes max along the block axes.
  5. AvgPool2D: Same strided view, takes mean.
  6. GlobalAvgPool2D: Mean over axes (2, 3) with keepdims=True.
  7. Upsample2D: x.repeat(scale, axis=2).repeat(scale, axis=3).
  8. BatchNorm: Normalizes using batch stats (training) or running stats (eval), then applies gamma * x_norm + beta.
  9. LayerNorm: Normalizes per-sample using feature stats, then applies gamma * x_norm + beta.
  10. Dropout: Random mask with inverted scaling during training; identity during eval.
  11. Embedding: Integer index lookup into the weight matrix.

After each layer, the output is appended to self.outputs, pre-activations to self.pre_activations, and normalization caches to self.batchnorm_cache / self.layernorm_cache.

im2col for Convolutions

The im2col(input_data, filter_h, filter_w, stride=1, pad=0) function converts image batches into column matrices suitable for efficient matrix multiplication:

  1. Pads the input with zeros if pad > 0.
  2. Uses NumPy's as_strided to create a view where each receptive field is a row.
  3. Transposes and reshapes to (N * out_h * out_w, C * filter_h * filter_w).

This allows convolutions to be computed as a single large matrix multiplication: output = col @ W_flat.T, which is significantly faster than nested loops in pure NumPy.


Backward Pass

Automatic Delta Computation

Backward(self, targets=None, output_delta=None) supports two modes:

Mode 1: Automatic (targets provided)

model.Backward(ys)
  • If the last layer uses "softmax" activation, the delta is computed as (out - targets) / batch_size (the combined softmax + cross-entropy gradient simplification).
  • Otherwise, delta = (out - targets) * derivative(activation, pre_activation) / batch_size.

Mode 2: Manual (output_delta provided)

model.Backward(None, output_delta=custom_delta)
  • Used in reinforcement learning and generative models where the output gradient is computed externally.
  • output_delta is reshaped to (batch, features) if 1D.

Per-Layer Gradient Propagation

After computing the output delta, the backward pass iterates from the second-to-last layer back to the first:

For each layer l (current) and l+1 (next), it computes the error err flowing into layer l based on the next layer's type:

Next Layer Type Error Computation
dense / sparse err = next_delta @ W_next
flatten err = next_delta.reshape(outputs[l+1].shape)
conv2d err = conv2d_backward_input(next_delta, W_next, outputs[l+1].shape) -- transposed convolution via im2col
maxpool2d err = maxpool2d_backward(next_delta, outputs[l+1], pool_size) -- routes gradient to max positions
avgpool2d err = avgpool2d_backward(next_delta, outputs[l+1], pool_size) -- distributes gradient evenly
globalavgpool2d err = globalavgpool2d_backward(next_delta, outputs[l+1]) -- distributes over spatial dims
upsample2d err = upsample2d_backward(next_delta, outputs[l+1], scale) -- sums repeated positions
dropout err = next_delta * mask / (1 - rate) (or identity if mask is None)
batchnorm err, dgamma, dbeta = batchnorm_backward(next_delta, cache) -- stores d_gamma, d_beta on the layer dict
layernorm err, dgamma, dbeta = layernorm_backward(next_delta, cache) -- stores d_gamma, d_beta on the layer dict
embedding err = next_delta (gradient flows back to previous layer)

Then, if the current layer has an activation (dense, sparse, conv2d), the error is multiplied by the activation derivative evaluated at the pre-activation: self.deltas[l] = err * derivative(activation, pre_activation).

Important: For BatchNorm and LayerNorm, the backward pass requires that Forward(training=True) was called first to populate the caches. If a cache is None, a ValueError is raised.


Optimizers

All optimizers are implemented in optimizer.py and applied in the update() method. The optimizer is selected via the optimizer parameter in NeuralNet.__init__().

SGD with Momentum

model = NeuralNet(optimizer="sgd", learning_rate=0.01, momentum=0.9)

Update rule for weights (same structure for biases, gamma, beta):

velocity = momentum * velocity - learning_rate * gradient
weight += velocity

Momentum accumulates velocity in the direction of persistent gradients, helping escape shallow local minima and accelerating convergence in consistent gradient directions.

RMSprop

model = NeuralNet(optimizer="rmsprop", learning_rate=0.001)

Update rule:

v = 0.999 * v + 0.001 * gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)

RMSprop adapts the learning rate per parameter by dividing by a running average of squared gradients. This helps with non-stationary objectives and sparse gradients.

Adagrad

model = NeuralNet(optimizer="adagrad", learning_rate=0.01)

Update rule:

v += gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)

Adagrad accumulates all historical squared gradients. It performs larger updates for infrequent parameters and smaller updates for frequent ones. Note that the learning rate naturally decays over time.

Adam with Bias Correction

model = NeuralNet(optimizer="adam", learning_rate=0.001)

Adam combines momentum (first moment) and RMSprop (second moment) with bias correction:

m = 0.9 * m + 0.1 * gradient        # first moment
v = 0.999 * v + 0.001 * gradient^2  # second moment
m_hat = m / (1 - 0.9^t)              # bias-corrected first moment
v_hat = v / (1 - 0.999^t)            # bias-corrected second moment
weight -= learning_rate * m_hat / (sqrt(v_hat) + 1e-8)

Where t is the global timestep incremented on each update() call. Bias correction prevents the initial estimates from being biased toward zero.

L2 Regularization

L2 regularization (weight decay) is applied to all weight gradients before the optimizer step:

grad_w = grad_w + l2_lambda * weights * mask

The mask term ensures that sparse layers only regularize their active connections. L2 regularization penalizes large weights, encouraging simpler models and reducing overfitting.

Optimizer state initialization: On the first call to update(), self.opt_state is lazily initialized with zero-initialized momentum/velocity buffers matching the shape of each layer's trainable parameters.


Loss Functions

All loss functions are implemented in loss.py via ComputeLoss(self, output, target, function="mse", reduction="mean", **kwargs).

Regression Losses

MSE (Mean Squared Error)

model.ComputeLoss(output, target, function="mse", reduction="mean")

loss = (output - target)^2

Standard regression loss. Penalizes large errors quadratically.

MAE (Mean Absolute Error)

model.ComputeLoss(output, target, function="mae", reduction="mean")

loss = |output - target|

More robust to outliers than MSE since errors are not squared.

Huber Loss

model.ComputeLoss(output, target, function="huber", delta=1.0, reduction="mean")
loss = 0.5 * diff^2              if diff < delta
loss = delta * (diff - 0.5*delta)  otherwise

Combines MSE for small errors and MAE for large errors. delta controls the transition point.

Smooth L1 Loss

model.ComputeLoss(output, target, function="smooth_l1", reduction="mean")

Huber loss with delta=1.0 hardcoded. Commonly used in object detection.

Classification Losses

Binary Cross-Entropy

model.ComputeLoss(output, target, function="binary_cross_entropy", reduction="mean")

loss = -(target * log(output) + (1 - target) * log(1 - output))

For binary classification with sigmoid output. Output is clipped to [1e-12, 1-1e-12] for numerical stability.

Cross-Entropy / Categorical Cross-Entropy

model.ComputeLoss(output, target, function="cross_entropy", reduction="mean")

loss = -target * log(output)

For multi-class classification with softmax output. Output clipped to [1e-12, 1.0]. Supports "mean", "sum", and "none" (per-element) reduction.

Note: When reduction="mean", the loss is divided by the batch size. When "sum", it is summed. When "none", the raw per-element loss array is returned.

Focal Loss

model.ComputeLoss(output, target, function="focal", alpha=0.25, gamma=2.0, reduction="mean")

Down-weights easy examples and focuses on hard examples:

pt = output * target + (1 - output) * (1 - target)
loss = -(alpha * target * (1-pt)^gamma * log(output) + (1-alpha) * (1-target) * pt^gamma * log(1-output))

Useful for imbalanced datasets. alpha balances positive/negative examples, gamma focuses on hard examples.

Hinge Loss

model.ComputeLoss(output, target, function="hinge", reduction="mean")

loss = max(0, 1 - target * output)

For SVM-style classification. Target should be +1 or -1.

BCE with Logits (Numerically Stable)

model.ComputeLoss(output, target, function="bce_logits", reduction="mean")

loss = max(output, 0) - output * target + log(1 + exp(-|output|))

Computes binary cross-entropy directly from logits (pre-sigmoid values) without explicitly computing sigmoid, avoiding numerical issues for extreme values.

Advanced Losses

Wasserstein Loss

model.ComputeLoss(output, target, function="wasserstein", reduction="mean")

loss = -output * target

Used in Wasserstein GANs. Target is +1 for real and -1 for fake.

Cosine Similarity Loss

model.ComputeLoss(output, target, function="cosine_similarity", reduction="mean")

loss = 1 - cos(output, target)

Measures the cosine distance between vectors. Useful for embedding learning and contrastive tasks.

Triplet Loss

model.ComputeLoss(output, target, function="triplet", margin=1.0, negative=neg_samples, reduction="mean")
d_pos = ||anchor - positive||^2
d_neg = ||anchor - negative||^2
loss = max(0, d_pos - d_neg + margin)

output is the anchor, target is the positive, and negative kwarg provides the negative samples. Used in metric learning (e.g., FaceNet).

NT-Xent (Normalized Temperature-scaled Cross Entropy)

model.ComputeLoss(output, target, function="ntxent", temperature=0.5, reduction="mean")

SimCLR contrastive loss. Computes pairwise cosine similarities, applies temperature scaling, and uses a cross-entropy formulation where positive pairs are on the diagonal.

Generative Losses

KL Divergence (for VAE)

model.ComputeLoss(output, target, function="kl_divergence", mu=mu, logvar=logvar, reduction="mean")

loss = -0.5 * sum(1 + logvar - mu^2 - exp(logvar), axis=-1)

KL divergence between the approximate posterior q(z|x) and the prior N(0, I). Used as a regularization term in VAEs.


Activation Functions

All activations and their derivatives are implemented in activations.py.

ReLU Family

Name Forward Derivative Notes
relu max(0, x) 1 if x > 0 else 0 Most common default.
leakyrelu x if x > 0 else 0.01*x 1 if x > 0 else 0.01 Small negative slope prevents dying ReLU.
elu x if x > 0 else exp(x)-1 1 if x > 0 else exp(x) Smooth negative region, mean closer to zero.
selu scale * (x if x>0 else alpha*(exp(x)-1)) scale * (1 if x>0 else alpha*exp(x)) Self-normalizing; alpha=1.6733, scale=1.0507.

Sigmoid Family

Name Forward Derivative Notes
sigmoid 1 / (1 + exp(-x)) sigmoid(x) * (1 - sigmoid(x)) Clipped to [-500, 500] to prevent overflow.
tanh tanh(x) 1 - tanh(x)^2 Zero-centered output.
softmax exp(x - max(x)) / sum(exp(x - max(x))) Handled specially in backward Numerically stable via max subtraction.
softplus log(1 + exp(x)) sigmoid(x) Smooth approximation of ReLU.

Advanced Activations

Name Forward Derivative Notes
gelu 0.5*x*(1 + tanh(sqrt(2/pi)*(x + 0.044715*x^3))) CDF + x*PDF Used in Transformer architectures.
swish x * sigmoid(x) sigmoid(x) + x*sigmoid(x)*(1-sigmoid(x)) Self-gated, smooth.
mish x * tanh(log(1 + exp(x))) tanh(sp) + x*sigmoid(x)*(1-tanh(sp)^2) sp = softplus(x). Smooth and self-regularizing.

Linear / Identity

Name Forward Derivative Notes
linear x 1 No transformation. Used for output layers before softmax/sigmoid.

Weight Initialization

All initializers are in weight_init.py and automatically called by layer addition methods.

Dense & Sparse Layer Initializers

For a layer with n_in inputs and n_out outputs:

Method Formula Best For
xavier_uniform U(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out))) Tanh/sigmoid activations
xavier_normal N(0, sqrt(2/(n_in+n_out))) Tanh/sigmoid activations
he_uniform U(-sqrt(6/n_in), sqrt(6/n_in)) ReLU activations
he_normal N(0, sqrt(2/n_in)) ReLU activations (default for conv)
normal N(0, 0.1) General purpose, small initial values
orthogonal SVD-based orthogonal matrix RNNs, preserving gradient norms
zeros All zeros Biases, or when you want to start from zero
ones All ones Special cases

Convolutional Layer Initializers

Same methods as dense, but fan-in is computed as in_ch * k * k (number of input connections per filter):

Method Formula
xavier_uniform U(-sqrt(6/(in_ch*k*k + out_ch)), ...)
xavier_normal N(0, sqrt(2/(in_ch*k*k + out_ch)))
he_uniform U(-sqrt(6/(in_ch*k*k)), ...)
he_normal N(0, sqrt(2/(in_ch*k*k))) (default)
normal N(0, 0.1)
orthogonal SVD on (out_ch, in_ch*k*k), reshaped to (out_ch, in_ch, k, k)
zeros All zeros
ones All ones

Embedding Layer Initializers

Method Formula
normal N(0, 0.1) (default)
xavier_uniform U(-sqrt(6/(vocab_size + embed_dim)), ...)
xavier_normal N(0, sqrt(2/(vocab_size + embed_dim)))
zeros All zeros

Note: All weights and biases are stored as np.float64 for maximum numerical precision.


Training Utilities

Train Method

The Train method in train.py provides a complete training loop with validation support, metric tracking, and learning rate scheduling.

history = model.Train(
    X_train, Y_train,
    epochs=10, batch_size=32,
    X_val=None, Y_val=None,
    loss_function=None,
    verbose=True,
    scheduler=None,
    **loss_kwargs
)
Parameter Type Default Description
X_train ndarray required Training inputs.
Y_train ndarray required Training targets.
epochs int 10 Number of training epochs.
batch_size int 32 Batch size for mini-batch gradient descent.
X_val ndarray or None None Validation inputs. If provided, validation metrics are computed each epoch.
Y_val ndarray or None None Validation targets.
loss_function str or None None Loss function name. If None, auto-detects based on last layer activation ("cross_entropy" for softmax, "mse" otherwise).
verbose bool True Whether to print progress.
scheduler LRScheduler or None None Learning rate scheduler instance.
**loss_kwargs Additional arguments passed to ComputeLoss (e.g., delta for Huber loss).

Returns: history dict with keys "loss", "accuracy", "val_loss", "val_accuracy", "lr".

Training loop details:

  1. If a scheduler is provided, scheduler.step(epoch) is called at the start of each epoch to update the learning rate.
  2. Training data is shuffled each epoch.
  3. Batches are processed sequentially. For each batch:
    • TrainBatch is called (forward, loss, backward, update).
    • Loss and accuracy are weighted by actual batch size (handles last incomplete batch).
  4. Epoch averages are computed and stored in history.
  5. If validation data is provided, the model runs in eval mode for validation (though Forward is called with training=False, the caches are not used for updates).

TrainBatch Method

loss, out = model.TrainBatch(xs, ys, loss_function=None, **loss_kwargs)

A single training step that:

  1. Calls Forward(xs, training=True)
  2. Auto-detects loss function if not provided
  3. Computes loss via ComputeLoss
  4. Calls Backward(ys)
  5. Calls update() to apply gradients

Returns the scalar loss and the network output.

Learning Rate Schedulers

The LRScheduler class in train.py supports multiple decay strategies:

from Enilnets import LRScheduler

# Step decay: halve LR every 10 epochs
scheduler = LRScheduler(initial_lr=0.001, mode="step", drop=0.5, epochs_drop=10)

# Exponential decay: multiply by 0.95 each epoch
scheduler = LRScheduler(initial_lr=0.001, mode="exponential", decay=0.95)

# Cosine annealing
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=100)

# Warmup + cosine
scheduler = LRScheduler(initial_lr=0.001, mode="warmup_cosine", max_epochs=100, warmup_epochs=5)
Mode Formula Parameters
"step" lr * drop^(epoch // epochs_drop) drop=0.5, epochs_drop=10
"exponential" lr * decay^epoch decay=0.95
"cosine" lr * 0.5 * (1 + cos(pi * epoch / max_epochs)) max_epochs=100
"warmup_cosine" Linear warmup then cosine max_epochs=100, warmup_epochs=5
"plateau" Returns initial_lr (placeholder) None

The scheduler's step(epoch) method returns the learning rate for that epoch. The Train method calls self.set_lr(lr) before each epoch.

Metrics Computation

Accuracy

acc = model.compute_accuracy(predictions, targets)
  • Multi-class (predictions.shape[-1] > 1): Compares argmax of predictions vs targets.
  • Binary (predictions.shape[-1] == 1): Thresholds at 0.5.

Returns mean accuracy as a float.

Precision, Recall, F1

metrics = model.compute_precision_recall_f1(predictions, targets)
# Returns: {"precision": float, "recall": float, "f1": float}

Binary classification metrics. Uses the same multi-class/binary detection as accuracy. Computed with 1e-12 epsilon to prevent division by zero.


Model Utilities

Training / Evaluation Mode

model.train()   # Set training=True, returns self (for chaining)
model.eval()    # Set training=False, returns self

These affect BatchNorm (batch vs running stats) and Dropout (active vs identity).

Learning Rate Control

model.set_lr(0.0001)   # Set learning rate
lr = model.get_lr()    # Get current learning rate

Gradient Clipping

model.clip_gradients(max_norm=1.0)

Clips the L2 norm of all deltas across all layers:

  1. Computes total_norm = sqrt(sum(||delta||^2)) over all non-None deltas.
  2. If total_norm > max_norm, scales all deltas by max_norm / total_norm.

Call this after Backward() and before update() to prevent exploding gradients.

Layer Freezing / Unfreezing

model.freeze()           # Freeze all layers
model.freeze(2)          # Freeze layer index 2 only
model.unfreeze()         # Unfreeze all layers
model.unfreeze(2)        # Unfreeze layer index 2 only

Frozen layers are skipped during update(). The _frozen flag is checked in the optimizer loop. This is useful for transfer learning and fine-tuning.

Weight Get / Set

weights = model.get_weights()   # Returns list of dicts, one per layer
model.set_weights(weights)      # Restores weights from list of dicts

get_weights() copies "weights", "bias", "gamma", "beta", and "mask" for each layer. set_weights() restores them. Useful for checkpointing, model averaging, and transfer.

Model Copying

net_copy = model.copy()

Creates a deep copy of the network including layers and optimizer state. The new network has the same architecture, weights, and optimizer buffers but is an independent object.

Optimizer State Reset

model.reset_optimizer_state()

Clears all optimizer momentum/velocity buffers and resets the timestep t to 0. Useful when starting training on a new dataset or after significant hyperparameter changes.

NaN / Inf Detection

issues = model.check_nan_inf()
# Returns list of strings like ["Layer 2 weights has NaN/Inf", "Delta 5 has NaN/Inf"]

Checks all weights, biases, gamma, beta, and deltas for non-finite values. Returns an empty list if everything is clean. Call this periodically during training to catch numerical instability early.

Model Summary

model.summary()

Prints a formatted table showing:

  • Optimizer, learning rate, L2 lambda
  • Per-layer information: type, input/output shapes, parameter counts
  • Total parameter count

Example output:

Model Summary
======================================================================
Optimizer: ADAM | LR: 0.001 | L2: 0.01
======================================================================
Layer 0: DENSE        Input:    784 Output:    256 Params: 200960
Layer 1: DROPOUT
Layer 2: DENSE        Input:    256 Output:     10 Params: 2570
Total Parameters: 203530
======================================================================

Reinforcement Learning

All RL methods are in reinforce.py and bound to NeuralNet.

Evolutionary Strategy (Evolve)

best_score = model.Evolve(inputs, score_fn, noise=0.05, tries=10, sigma=1.0)

A black-box optimization method that perturbs network weights with Gaussian noise and keeps the best variant.

Parameter Type Default Description
inputs ndarray required Input data to evaluate the network on.
score_fn callable required Function that takes network output and returns a scalar score (higher is better).
noise float 0.05 Standard deviation scale for weight perturbations.
tries int 10 Number of candidate networks to try.
sigma float 1.0 Additional scaling factor for noise.

Algorithm:

  1. Evaluate the current network on inputs to get a baseline score.
  2. For each try, create a deep copy of the network, add Gaussian noise to all weights and biases (respecting sparse masks), evaluate the candidate.
  3. If the candidate scores higher, keep it as the new best.
  4. Restore the best network.

Returns the best score achieved.

REINFORCE (Policy Gradient)

mean_return = model.Reinforce(
    states, actions, returns,
    action_type="discrete", std=1.0, normalize_returns=True
)

Monte-Carlo policy gradient method.

Parameter Type Default Description
states ndarray required Observed states, shape (N, features).
actions ndarray required Discrete: (N,) integer indices. Continuous: (N, action_dim).
returns ndarray required Discounted returns for each state-action pair, shape (N,) or (N, 1).
action_type str "discrete" "discrete" (categorical) or "continuous" (Gaussian).
std float 1.0 Standard deviation for continuous Gaussian policy.
normalize_returns bool True Whether to z-score normalize returns before computing gradients.

Discrete actions: Network output is treated as action probabilities. The gradient is (out - one_hot(actions)) * returns / batch_size.
Continuous actions: Network output is treated as action means. The gradient is -(actions - means) / std^2 * returns / batch_size.

Returns the mean of the raw (un-normalized) returns.

Proximal Policy Optimization (PPO)

policy_loss = model.PPO(
    states, actions, old_log_probs, advantages,
    action_type="discrete", epsilon=0.2, std=1.0,
    value_targets=None, value_coeff=0.5, entropy_coeff=0.01
)

PPO is a policy gradient method that clips the objective to prevent overly large policy updates.

Parameter Type Default Description
states ndarray required Observed states, shape (N, features).
actions ndarray required Discrete: (N,) integers. Continuous: (N, action_dim).
old_log_probs ndarray required Log probabilities under the old policy, shape (N, 1).
advantages ndarray required Advantage estimates, shape (N, 1).
action_type str "discrete" "discrete" or "continuous".
epsilon float 0.2 Clipping parameter.
std float 1.0 Fixed std for continuous Gaussian policy.
value_targets ndarray or None None Target values for value head (not yet implemented in gradient).
value_coeff float 0.5 Coefficient for value loss (reserved).
entropy_coeff float 0.01 Coefficient for entropy bonus (encourages exploration).

Discrete PPO:

  1. Computes action probabilities from network output.
  2. Computes log probabilities for the taken actions.
  3. Computes probability ratio ratio = exp(new_log_prob - old_log_prob).
  4. Computes clipped surrogate objective: min(ratio * advantage, clip(ratio, 1-eps, 1+eps) * advantage).
  5. Approximates policy gradient: for each sample, if not clipped, gradient flows through the taken action proportional to -advantage / prob.
  6. Adds entropy gradient: (1 + log_probs) * entropy_coeff / batch_size.

Continuous PPO: Uses Gaussian log probabilities and computes gradient as -(actions - means) / std^2 * advantages / batch_size.

Returns the mean policy loss (negative of the clipped objective).

Actor-Critic

value_loss = model.ActorCritic(
    states, actions, returns, values,
    action_type="discrete", std=1.0
)

Combines policy gradient with a value function baseline.

Parameter Type Default Description
states ndarray required Observed states.
actions ndarray required Taken actions.
returns ndarray required Discounted returns, shape (N, 1).
values ndarray required Predicted values from value network, shape (N, 1).
action_type str "discrete" "discrete" or "continuous".
std float 1.0 Std for continuous actions.

Algorithm:

  1. Computes advantages: advantages = returns - values.
  2. Uses the same policy gradient as REINFORCE but weighted by advantages instead of raw returns.
  3. Returns the mean squared advantage (a proxy for value function error).

Note: This implementation uses a single network. In practice, you may want separate actor and critic networks or a network with two output heads.

RL Utility Functions

compute_returns

from Enilnets import compute_returns

returns = compute_returns(rewards, gamma=0.99)

Computes discounted returns for a single episode:

G_t = reward_t + gamma * G_{t+1}

Iterates backwards through the reward array. Returns an array of the same shape.

gae (Generalized Advantage Estimation)

from Enilnets.generative.sampling import gae

advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)
Parameter Type Default Description
rewards ndarray required Step rewards, shape (T,).
values ndarray required Value estimates including bootstrap, shape (T+1,).
gamma float 0.99 Discount factor.
lambda_ float 0.95 GAE lambda (0 = high bias, 1 = high variance).

Computes TD-residuals delta_t = reward_t + gamma * V(s_{t+1}) - V(s_t), then accumulates them with exponential decay: A_t = delta_t + gamma * lambda * A_{t+1}. Returns (advantages, returns) where returns = advantages + values[:T].


Generative AI Framework

All generative models are in the generative/ subpackage and imported from the top-level Enilnets package.

Variational Autoencoder (VAE)

File: generative/vae.py

A VAE learns a probabilistic latent representation of data. It consists of an encoder (maps data to latent distribution parameters) and a decoder (maps latent samples back to data).

from Enilnets import VAE

vae = VAE(
    input_dim=784, latent_dim=32,
    encoder_hidden=[512, 256],
    decoder_hidden=[256, 512],
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
Parameter Type Default Description
input_dim int required Dimensionality of input data (flattened).
latent_dim int required Dimensionality of the latent space.
encoder_hidden list[int] [512, 256] Hidden layer sizes for the encoder.
decoder_hidden list[int] [256, 512] Hidden layer sizes for the decoder.
activation str "swish" Activation for hidden layers.
learning_rate float 0.001 Learning rate for both encoder and decoder.
optimizer str "adam" Optimizer type.
l2_lambda float 0.0 L2 regularization.

Architecture:

  • Encoder: Dense layers with specified activation, final layer outputs latent_dim * 2 values (mu and logvar) with linear activation.
  • Decoder: Dense layers with specified activation, final layer outputs input_dim values with sigmoid activation (assumes data in [0, 1]).

Methods:

Method Signature Description
encode encode(x) -> (mu, logvar) Maps input to latent distribution parameters.
decode decode(z) -> recon Maps latent samples to reconstructed data.
forward forward(x) -> (recon, mu, logvar, z) Full forward pass through encoder + reparameterization + decoder.
loss loss(x, recon=None, mu=None, logvar=None) -> float Computes reconstruction loss (binary cross-entropy) + KL divergence.
train_step train_step(x) -> float One training step: forward, backward through decoder, backward through encoder, update both networks.
Train Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] Full training loop. Returns list of average losses per epoch.
generate generate(n_samples=1) -> ndarray Samples from N(0, I) in latent space and decodes.
reconstruct reconstruct(x) -> ndarray Encodes and decodes input (reconstruction).
interpolate interpolate(x1, x2, n_steps=10) -> ndarray Linear interpolation in latent space between two inputs.

Training details:

  1. Forward: encode -> reparameterize -> decode.
  2. Decoder backward: computes d_recon = (recon - x) / batch_size, multiplies by sigmoid derivative recon * (1 - recon), backpropagates through decoder, updates decoder weights.
  3. Encoder backward: computes gradient of latent samples w.r.t. decoder input, combines with reparameterization gradients to get d_mu and d_logvar, backpropagates through encoder, updates encoder weights.
  4. Loss = binary cross-entropy reconstruction + KL(q(z|x) || N(0, I)).

Generative Adversarial Network (GAN)

File: generative/gan.py

A GAN trains a generator to produce realistic data and a discriminator to distinguish real from fake.

from Enilnets import GAN

gan = GAN(
    latent_dim=100, data_dim=784,
    generator_hidden=[256, 512],
    discriminator_hidden=[512, 256],
    g_activation="swish", d_activation="leakyrelu",
    loss_type="bce",
    learning_rate=0.0002, optimizer="adam", l2_lambda=0.0
)
Parameter Type Default Description
latent_dim int required Dimensionality of the noise vector.
data_dim int required Dimensionality of generated data.
generator_hidden list[int] [256, 512] Hidden layer sizes for generator.
discriminator_hidden list[int] [512, 256] Hidden layer sizes for discriminator.
g_activation str "swish" Generator hidden activation.
d_activation str "leakyrelu" Discriminator hidden activation.
loss_type str "bce" "bce", "bce_logits", or "wasserstein".
learning_rate float 0.0002 LR for both networks.
optimizer str "adam" Optimizer type.
l2_lambda float 0.0 L2 regularization.

Architecture:

  • Generator: Dense layers with g_activation, final layer with tanh activation.
  • Discriminator: Dense layers with d_activation, final layer with sigmoid (for BCE) or linear (for Wasserstein).

Methods:

Method Signature Description
generate generate(n_samples) -> ndarray Samples noise and runs generator forward.
discriminate discriminate(x) -> ndarray Runs discriminator forward.
Train Train(X_train, epochs=10, batch_size=64, d_steps=1, g_steps=1, verbose=True) -> dict Alternates discriminator and generator training.
sample sample(n_samples=16) -> ndarray Alias for generate.

Loss types:

Type Discriminator Target Generator Gradient
"bce" Real=1, Fake=0 -1 / D(fake)
"bce_logits" Real=1, Fake=0 Logits-based stable gradient
"wasserstein" Real=1, Fake=-1 -1 (constant)

Training loop:

  1. For each batch, train discriminator for d_steps iterations on real + fake data.
  2. Train generator for g_steps iterations by backpropagating through the discriminator to get gradients w.r.t. generator input.
  3. Track and report D_loss and G_loss per epoch.

Diffusion Model (DDPM)

File: generative/diffusion.py

Implements Denoising Diffusion Probabilistic Models (DDPM). The model learns to reverse a gradual noising process.

from Enilnets import DiffusionModel

diffusion = DiffusionModel(
    data_shape=(784,), time_steps=1000,
    beta_schedule="linear", beta_start=1e-4, beta_end=0.02,
    denoiser_type="mlp", denoiser_hidden=[512, 512, 512],
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
Parameter Type Default Description
data_shape tuple required Shape of data. (D,) for flattened, (C, H, W) for images.
time_steps int 1000 Number of diffusion timesteps.
beta_schedule str "linear" "linear" or "cosine" noise schedule.
beta_start float 1e-4 Starting beta value (linear schedule).
beta_end float 0.02 Ending beta value (linear schedule).
denoiser_type str "mlp" "mlp" or "conv" denoiser architecture.
denoiser_hidden list[int] [512, 512, 512] Hidden sizes for MLP denoiser.
learning_rate float 0.001 Learning rate.
optimizer str "adam" Optimizer type.
l2_lambda float 0.0 L2 regularization.

Noise schedules:

  • Linear: betas = linspace(beta_start, beta_end, time_steps)
  • Cosine: Uses a cosine-squared schedule with offset s=0.008 for smoother noise addition.

Precomputed constants (computed in __init__):

  • alphas = 1 - betas
  • alphas_cumprod = cumprod(alphas) -- cumulative product of alphas
  • sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod -- for forward diffusion
  • sqrt_recip_alphas, posterior_variance -- for reverse diffusion

Methods:

Method Signature Description
train_step train_step(x_0) -> float One training step: sample timestep t, add noise, predict noise, compute MSE loss, backpropagate.
Train Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] Full training loop.
sample sample(n_samples=16, shape=None, clip=True) -> ndarray Generate samples by iteratively denoising from pure noise.
denoise denoise(x_noisy, t_start, t_end=0) -> ndarray Denoise a partially noised input from timestep t_start down to t_end.

Denoiser architectures:

  • MLP: Concatenates flattened input with sinusoidal time embedding, passes through dense layers.
  • Conv: Stack of conv2d layers. Time embedding is broadcast spatially and added to feature maps.

Forward diffusion: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise

Reverse diffusion: At each step, predict noise, compute mean of p(x_{t-1} | x_t), add scaled noise (except at t=0).

Autoregressive Model (MADE)

File: generative/autoregressive.py

A masked autoregressive model that enforces causality -- each output dimension only depends on previous dimensions.

from Enilnets import AutoregressiveModel

ar = AutoregressiveModel(
    data_dim=784, hidden_dims=[512, 512],
    data_shape=(28, 28), activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
Parameter Type Default Description
data_dim int required Total number of dimensions.
hidden_dims list[int] [512, 512] Hidden layer sizes.
data_shape tuple or None None Original shape for reshaping output (e.g., (28, 28)).
activation str "swish" Hidden activation.
learning_rate float 0.001 Learning rate.
optimizer str "adam" Optimizer type.
l2_lambda float 0.0 L2 regularization.

Architecture: Standard MLP with linear output activation.

Causality enforcement: A lower-triangular mask (with zeros on and above diagonal) is applied to the input before feeding it to the network: x_masked[i] = sum_{j < i} mask[i,j] * x[j]. This ensures the i-th output only sees dimensions 0 through i-1.

Methods:

Method Signature Description
forward forward(x, training=True) -> ndarray Causal forward pass.
loss loss(x) -> float MSE between predictions and targets.
train_step train_step(x) -> float One training step with custom backpropagation.
Train Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] Full training loop.
generate generate(n_samples=1, shape=None) -> ndarray Autoregressive sampling: predict dim 0, use it to predict dim 1, etc.
complete complete(partial_x, n_dims=None) -> ndarray Complete a partial sample by autoregressively filling remaining dimensions.

Generation: Starts with zeros, iteratively predicts each dimension and adds small Gaussian noise for diversity.

Normalizing Flows (RealNVP)

File: generative/flows.py

Real-valued Non-Volume Preserving (RealNVP) flow for density estimation and sampling.

from Enilnets import RealNVP

flow = RealNVP(
    data_dim=784, n_coupling=4, hidden_dim=256,
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
Parameter Type Default Description
data_dim int required Dimensionality of data.
n_coupling int 4 Number of coupling layers.
hidden_dim int 256 Hidden dimension for s and t networks.
activation str "swish" Activation for coupling networks.
learning_rate float 0.001 Learning rate.
optimizer str "adam" Optimizer type.
l2_lambda float 0.0 L2 regularization.

Architecture: Each coupling layer has two networks:

  • s_net (scale): Predicts log-scale factors. Output activation is tanh.
  • t_net (translation): Predicts translation. Output activation is linear.

Both are 3-layer MLPs: data_dim//2 -> hidden_dim -> hidden_dim -> data_dim - data_dim//2.

Coupling transform: For input split into x1 and x2:

y2 = x2 * exp(s(x1)) + t(x1)
output = concat(x1, y2)
log_det += sum(s(x1))

Alternating masks (even/odd splits) ensure all dimensions are transformed.

Methods:

Method Signature Description
forward forward(x) -> (z, log_det) Maps data to latent space. Returns transformed data and log determinant.
inverse inverse(z) -> x Maps latent samples back to data space.
log_prob log_prob(x) -> ndarray Computes log probability: log p(z) + log_det where p(z) is N(0, I).
loss loss(x) -> float Negative mean log probability.
train_step train_step(x) -> float Returns loss (training uses Evolve, see below).
Train Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] Trains each coupling layer sequentially using evolutionary strategy.
sample sample(n_samples=1) -> ndarray Samples from base distribution and applies inverse transform.
interpolate interpolate(x1, x2, n_steps=10) -> ndarray Linear interpolation in latent space.

Training: Uses Evolve (evolutionary strategy) rather than analytical backprop through the log-determinant Jacobian. Each coupling layer is trained while keeping previous layers fixed.

Energy-Based Model (EBM)

File: generative/ebm.py

An energy-based model that assigns low energy to real data and high energy to generated samples.

from Enilnets import EnergyBasedModel

ebm = EnergyBasedModel(
    data_dim=784, hidden_dims=[512, 512],
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
Parameter Type Default Description
data_dim int required Dimensionality of data.
hidden_dims list[int] [512, 512] Hidden layer sizes for energy network.
activation str "swish" Hidden activation.
learning_rate float 0.001 Learning rate.
optimizer str "adam" Optimizer type.
l2_lambda float 0.0 L2 regularization.

Architecture: MLP mapping data to a scalar energy value. Final layer has linear activation.

Methods:

Method Signature Description
energy energy(x) -> ndarray Computes scalar energy for input data, shape (batch, 1).
_energy_grad _energy_grad(x) -> (energy, grad) Finite-difference gradient of energy w.r.t. input (eps=1e-4).
train_step train_step(x_data, n_cd_steps=10, step_size=0.1, noise_scale=0.005) -> float One contrastive divergence step.
Train Train(X_train, epochs=10, batch_size=64, n_cd_steps=10, step_size=0.1, noise_scale=0.005, verbose=True) -> list[float] Full training loop.
sample sample(n_samples=1, n_steps=100, step_size=0.1, noise_scale=0.005) -> ndarray Langevin dynamics sampling from random initialization.
score score(x) -> ndarray Returns the energy gradient (score function).

Training (Contrastive Divergence):

  1. Generate negative samples by running Langevin dynamics from random noise for n_cd_steps iterations.
  2. Compute energy on real data and negative samples.
  3. Update network to push down energy on real data (target=1) and push up on negative samples (target=-1).
  4. Loss = mean(energy(data) - energy(negative_samples)).

Sampling: Langevin dynamics iterates x = x - step_size * grad(energy) + noise_scale * N(0, I).

UNet Denoiser

File: generative/unet.py

A UNet architecture for spatial denoising, designed for use with diffusion models.

from Enilnets import UNetDenoiser, time_embedding

unet = UNetDenoiser(
    in_ch=1, base_ch=64, time_emb_dim=128,
    ch_mult=(1, 2, 4)
)
Parameter Type Default Description
in_ch int required Number of input channels.
base_ch int 64 Base number of channels.
time_emb_dim int 128 Dimensionality of time embedding.
ch_mult tuple (1, 2, 4) Channel multipliers for each encoder level.

Architecture:

  • Time embedding MLP: time_emb_dim -> time_emb_dim*4 -> time_emb_dim*4 with swish activation.
  • Encoder path: len(ch_mult) levels. Each level has 2 conv2d layers (k=1) with swish activation. Time embedding is added to features (broadcast spatially if channel dims match).
  • Downsampling: Average pooling by factor 2 between encoder levels.
  • Bottleneck: 2 conv2d layers at the deepest level.
  • Decoder path: Mirrors encoder with skip connections. Upsamples, concatenates with skip, applies 2 conv2d layers.
  • Output: Conv2d (k=1) mapping back to in_ch channels with linear activation.

Methods:

Method Signature Description
forward forward(x, t) -> ndarray Full UNet forward pass with time conditioning.
backward backward(grad_output) Raises NotImplementedError.
get_params get_params() -> list[NeuralNet] Returns all sub-networks for external optimization.

Important: The UNet uses k=1 convolutions to avoid spatial dimension changes (since the library has no padding support). The backward method is not implemented; use DiffusionModel with denoiser_type="mlp" for fully trainable diffusion, or implement custom backpropagation for the UNet.

time_embedding function:

emb = time_embedding(t, dim, max_period=10000)

Sinusoidal time embedding used in diffusion models:

freqs = exp(-log(max_period) * arange(dim//2) / (dim//2))
emb = [sin(t * freqs), cos(t * freqs)]

If dim is odd, a zero column is appended.


Sampling Utilities

File: generative/sampling.py

VAE Reparameterization

from Enilnets import reparameterize

z = reparameterize(mu, logvar)

The reparameterization trick: z = mu + exp(0.5 * logvar) * eps where eps ~ N(0, I).

Enables backpropagation through stochastic nodes by separating the randomness from the parameters.

Langevin Dynamics

from Enilnets import langevin_dynamics

x_sampled = langevin_dynamics(energy_fn, x_init, n_steps=20, step_size=0.1, noise_scale=0.005)

Langevin Monte Carlo for sampling from energy-based models:

for step in range(n_steps):
    energy, grad = energy_fn(x)
    x = x - step_size * grad + noise_scale * N(0, I)
Parameter Type Default Description
energy_fn callable required Function taking x and returning (energy, grad_energy).
x_init ndarray required Initial samples.
n_steps int 20 Number of Langevin steps.
step_size float 0.1 Gradient descent step size.
noise_scale float 0.005 Standard deviation of injected noise.

Gaussian & Uniform Sampling

from Enilnets import gaussian_sample, uniform_sample

# Gaussian: N(mean, std^2)
samples = gaussian_sample(mean, std, shape=None)
# If shape is None, uses mean.shape

# Uniform: U(low, high)
samples = uniform_sample(low, high, shape)

Gumbel-Softmax

from Enilnets import gumbel_softmax_sample

samples = gumbel_softmax_sample(logits, temperature=1.0, hard=False)

Differentiable sampling from categorical distributions:

  1. Add Gumbel noise: y = logits + Gumbel(0, 1)
  2. Apply temperature-scaled softmax: y_soft = softmax(y / temperature)
  3. If hard=True, use straight-through estimator: y_hard - y_soft + y_soft (discrete forward, continuous backward).

Useful for training models with discrete latent variables.

Random Masking

from Enilnets import random_mask

mask = random_mask(shape, ratio)

Generates a binary mask where each element is 1 with probability ratio (keep ratio), 0 otherwise. Returns np.float64 array.

Top-p (Nucleus) Sampling

from Enilnets.generative.sampling import top_p_sampling

samples = top_p_sampling(logits, p=0.9, temperature=1.0)

Nucleus sampling for text generation:

  1. Convert logits to probabilities with temperature scaling.
  2. Sort probabilities in descending order.
  3. Find the smallest set of tokens whose cumulative probability exceeds p.
  4. Sample from this truncated distribution.

Always keeps at least the top token. Returns one-hot encoded samples.

Discounted Returns

from Enilnets import compute_returns

returns = compute_returns(rewards, gamma=0.99)

Computes cumulative discounted returns for a single episode:

G_t = reward_t + gamma * G_{t+1}

Iterates backward through the reward array. Returns array of same shape.

Generalized Advantage Estimation (GAE)

from Enilnets.generative.sampling import gae

advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)
Parameter Type Default Description
rewards ndarray required Step rewards, shape (T,).
values ndarray required Value estimates including bootstrap V(s_{T+1}), shape (T+1,).
gamma float 0.99 Discount factor.
lambda_ float 0.95 GAE lambda parameter.

Algorithm:

for t in reversed(range(T)):
    delta = rewards[t] + gamma * values[t+1] - values[t]
    gae_t = delta + gamma * lambda_ * gae_t
    advantages[t] = gae_t
returns = advantages + values[:T]

GAE provides a bias-variance tradeoff controlled by lambda_: lambda_=0 gives high-bias TD(0), lambda_=1 gives high-variance Monte Carlo.


Model I/O

File: io.py

JSON Serialization

model.Save("model.json")

Saves the model to a human-readable JSON file. All NumPy arrays are converted to Python lists via a custom encoder. Stores:

  • version: 3
  • layers: All layer dictionaries
  • optimizer: Optimizer type string
  • learning_rate, l2_lambda, momentum: Hyperparameters
  • t: Global timestep

Note: JSON does not preserve NumPy array types; arrays are stored as nested lists.

Pickle Serialization

model.Save("model.pkl")

Saves the model to a binary pickle file. Preserves NumPy arrays exactly. More compact and faster than JSON. Detected automatically by .pkl extension.

Loading Models

model = NeuralNet()  # Create a new instance
model.Load("model.json")  # or "model.pkl"

The Load method:

  1. Detects file format by extension (.pkl for pickle, otherwise JSON).
  2. Loads the payload.
  3. Reconstructs layers, converting list data back to np.float64 arrays for all weight-related keys (weights, bias, mask, gamma, beta, running_mean, running_var).
  4. Restores hyperparameters (learning_rate, optimizer_type, l2_lambda, momentum, t).
  5. Resets optimizer state (opt_state = []) -- you may want to call reset_optimizer_state() after loading.

Important: The loaded model does not restore optimizer momentum buffers. If you need to resume training exactly, you would need to save and load opt_state separately (not currently supported).


Known Limitations

  1. No padding in Conv2D: All convolutions use pad=0, so spatial dimensions shrink by k-1 per layer. The UNet works around this by using k=1 convolutions.

  2. UNet backward not implemented: The UNetDenoiser.backward() method raises NotImplementedError. Use DiffusionModel with denoiser_type="mlp" for fully trainable diffusion, or implement custom backpropagation.

  3. Flows use evolutionary training: RealNVP uses Evolve (black-box optimization) rather than analytical backprop through the log-determinant Jacobian. This is slower and less precise than gradient-based training.

  4. GAN training can be unstable: As with all GANs, convergence depends heavily on architecture, learning rates, and data. The library provides three loss types but no spectral normalization or other advanced stabilization techniques.

  5. No GPU acceleration: Pure NumPy implementation runs on CPU only. Large models and datasets will be slow.

  6. Stride support is limited: While stride is stored in conv2d layers, the im2col implementation only fully supports stride=1.

  7. No recurrent layers: No LSTM, GRU, or vanilla RNN support. For sequence modeling, use the embedding layer with dense layers or implement custom recurrence.

  8. No automatic differentiation: Gradients are hand-coded for each layer type. Adding new layers requires implementing both forward and backward passes.

  9. Optimizer state not saved: Save()/Load() do not persist optimizer momentum/velocity buffers. Training resumes from scratch in terms of optimizer state.

  10. Single precision not supported: All computations use np.float64. This provides numerical stability but uses more memory than float32.


Version History

v2.0.0

Major update including:

  • New layer types: LayerNorm, Embedding, GlobalAvgPool2D, Upsample2D, Sparse
  • Learning rate schedulers: Step decay, exponential decay, cosine annealing, warmup+cosine, plateau
  • Reinforcement learning: REINFORCE, PPO, Actor-Critic, Evolutionary Strategy
  • Gradient clipping: L2 norm clipping across all layers
  • Layer freezing/unfreezing: Fine-grained control over which layers train
  • NaN/Inf detection: check_nan_inf() for debugging numerical issues
  • Improved BatchNorm: Full 2D and 4D support with proper running statistics
  • New loss functions: Cosine similarity, triplet margin, NT-Xent (SimCLR), focal loss, BCE with logits, Wasserstein loss
  • Generative AI framework: VAE, GAN, Diffusion Model, Autoregressive Model, RealNVP, Energy-Based Model, UNet Denoiser
  • Sampling utilities: Reparameterization, Langevin dynamics, Gumbel-Softmax, top-p sampling, GAE
  • Perceptual loss utilities: Placeholder functions for VGG-based perceptual loss
  • Model copying and state reset: copy() and reset_optimizer_state()

Complete API Reference

NeuralNet Methods

Method Description
__init__(learning_rate, optimizer, l2_lambda, momentum) Constructor
summary() Print architecture summary
add_dense(n_in, n_out, activation, init_method, use_bias) Add fully connected layer
add_sparse(n_in, n_out, connectivity, activation, init_method) Add sparse connected layer
add_conv2d(in_ch, out_ch, k, activation, init_method, stride) Add 2D convolution
add_flatten() Add flatten layer
add_maxpool2d(pool_size) Add max pooling
add_avgpool2d(pool_size) Add average pooling
add_global_avgpool2d() Add global average pooling
add_upsample2d(scale_factor) Add 2x upsampling
add_batchnorm(num_features, epsilon, momentum) Add batch normalization
add_layernorm(normalized_shape, epsilon) Add layer normalization
add_dropout(rate) Add dropout regularization
add_embedding(vocab_size, embed_dim, init_method) Add embedding lookup layer
Forward(inputs, training, dropout_rate) Forward pass
predict(inputs) Alias for Forward
train() Set training mode
eval() Set evaluation mode
set_lr(lr) Set learning rate
get_lr() Get learning rate
freeze(layer_idx) Freeze layer(s)
unfreeze(layer_idx) Unfreeze layer(s)
clip_gradients(max_norm) Clip gradient norms
get_weights() Copy all weights
set_weights(weights) Restore weights
copy() Deep copy network
reset_optimizer_state() Clear optimizer buffers
check_nan_inf() Check for NaN/Inf
Backward(targets, output_delta) Backpropagation
update() Apply parameter updates
TrainBatch(xs, ys, loss_function, **kwargs) Train one batch
Train(X, Y, epochs, batch_size, X_val, Y_val, loss_function, verbose, scheduler, **kwargs) Full training loop
ComputeLoss(out, tgt, function, reduction, **kwargs) Compute loss
compute_accuracy(pred, tgt) Compute classification accuracy
compute_precision_recall_f1(pred, tgt) Compute precision, recall, F1
Evolve(inputs, score_fn, noise, tries, sigma) Evolutionary strategy
Reinforce(states, actions, returns, action_type, std, normalize_returns) Policy gradient
PPO(states, actions, old_log_probs, advantages, action_type, epsilon, std, value_targets, value_coeff, entropy_coeff) Proximal Policy Optimization
ActorCritic(states, actions, returns, values, action_type, std) Actor-Critic
Save(file) Save model to file
Load(file) Load model from file

Generative Classes

Class Module Description
VAE generative.vae Variational Autoencoder
GAN generative.gan Generative Adversarial Network
DiffusionModel generative.diffusion DDPM diffusion model
AutoregressiveModel generative.autoregressive MADE-style autoregressive model
RealNVP generative.flows RealNVP normalizing flow
EnergyBasedModel generative.ebm Energy-based model
UNetDenoiser generative.unet UNet for diffusion denoising
LRScheduler train Learning rate scheduler

Generative Loss Functions

Function Module Description
kl_divergence_gaussian(mu, logvar, reduction) generative_loss KL(q(z|x) || N(0,I))
adversarial_loss_discriminator(real_logits, fake_logits, loss_type) generative_loss Discriminator loss (BCE/BCE_logits/Wasserstein)
adversarial_loss_generator(fake_logits, loss_type) generative_loss Generator loss
diffusion_loss(pred_noise, true_noise, reduction) generative_loss MSE noise prediction
nll_loss(log_px, log_det_jacobian, reduction) generative_loss Flow negative log-likelihood
energy_loss(data_energy, sample_energy, margin) generative_loss EBM contrastive loss
perceptual_loss(x, y, feature_extractor) generative_loss Perceptual loss (falls back to MSE)
vgg_loss(x, y) generative_loss Placeholder for VGG perceptual loss

Sampling Functions

Function Module Description
reparameterize(mu, logvar) sampling VAE reparameterization trick
langevin_dynamics(energy_fn, x_init, n_steps, step_size, noise_scale) sampling MCMC sampling for EBMs
gaussian_sample(mean, std, shape) sampling Gaussian sampling
uniform_sample(low, high, shape) sampling Uniform sampling
gumbel_softmax_sample(logits, temperature, hard) sampling Differentiable categorical sampling
random_mask(shape, ratio) sampling Random boolean mask
top_p_sampling(logits, p, temperature) sampling Nucleus (top-p) sampling
compute_returns(rewards, gamma) sampling Discounted returns
gae(rewards, values, gamma, lambda_) sampling Generalized Advantage Estimation

Weight Initialization Functions

Function Description
init_weights(n_in, n_out, method) Dense/sparse weight initialization
init_conv_weights(in_ch, out_ch, k, method) Conv2D weight initialization
init_embedding_weights(vocab_size, embed_dim, method) Embedding weight initialization

Utility Functions

Function Description
im2col(input_data, filter_h, filter_w, stride, pad) Convert images to column matrix for efficient convolution
time_embedding(t, dim, max_period) Sinusoidal time embedding for diffusion models
activate(name, x) Apply activation function
derivative(name, x) Compute activation derivative

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enilnets-2.1.0.tar.gz (95.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

enilnets-2.1.0-py3-none-any.whl (57.7 kB view details)

Uploaded Python 3

File details

Details for the file enilnets-2.1.0.tar.gz.

File metadata

  • Download URL: enilnets-2.1.0.tar.gz
  • Upload date:
  • Size: 95.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for enilnets-2.1.0.tar.gz
Algorithm Hash digest
SHA256 ccf3348443f28e724a7777ca69ad5bc47e20ce3bf233998a3252d11c80ebf684
MD5 9e022ca5618743c173deb82a8f4d9d06
BLAKE2b-256 2cb0c57282b1a914cea629dacc67fe61275d3827780df3f5317e47b4cc4d500d

See more details on using hashes here.

File details

Details for the file enilnets-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: enilnets-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 57.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for enilnets-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2498d0a6065dbe2ddb631ea3217f1f37349c922f1532659ce10b8d4df769c305
MD5 7ef685e61115f5b578601b9ed9179401
BLAKE2b-256 2617e0120a4073357d46595edceb36c8d189b843fcc2203d32375b8c507bfb4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page