A simple neural network library written in Python
Project description
Enilnets Library Documentation
A pure NumPy-based deep learning library with support for dense, convolutional, pooling, batch normalization, dropout, layer normalization, embedding, upsampling, global pooling, and sparse layers. Includes multiple optimizers, loss functions, activation functions, weight initialization methods, learning rate schedulers, reinforcement learning (REINFORCE, PPO, Actor-Critic), and a full generative AI framework.
Table of Contents
- Quick Start
- Installation & Project Structure
- Core Architecture
- Layer Types
- Forward Pass
- Backward Pass
- Optimizers
- Loss Functions
- Activation Functions
- Weight Initialization
- Training Utilities
- Model Utilities
- Reinforcement Learning
- Generative AI Framework
- Sampling Utilities
- Model I/O
- Known Limitations
- Version History
Quick Start
Discriminative Example
Build and train a classifier on flat data:
from Enilnets import NeuralNet, LRScheduler
import numpy as np
model = NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.01)
model.add_dense(784, 256, activation="relu")
model.add_dropout(0.3)
model.add_dense(256, 10, activation="softmax")
X_train = np.random.randn(1000, 784)
Y_train = np.eye(10)[np.random.randint(0, 10, 1000)]
# With learning rate scheduler
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=50)
history = model.Train(X_train, Y_train, epochs=50, batch_size=32, scheduler=scheduler)
Generative Example (VAE)
Train a Variational Autoencoder on image-like data:
from Enilnets import VAE
import numpy as np
vae = VAE(input_dim=784, latent_dim=32,
encoder_hidden=[512, 256], decoder_hidden=[256, 512],
learning_rate=0.001, optimizer="adam")
X_train = np.random.rand(1000, 784)
history = vae.Train(X_train, epochs=20, batch_size=64)
generated = vae.generate(n_samples=16)
Reinforcement Learning (PPO)
Train a policy network with Proximal Policy Optimization:
from Enilnets import NeuralNet
import numpy as np
policy = NeuralNet(learning_rate=3e-4, optimizer="adam")
policy.add_dense(4, 64, activation="tanh")
policy.add_dense(64, 2, activation="softmax")
# states, actions, old_log_probs, advantages from environment
policy.PPO(states, actions, old_log_probs, advantages, action_type="discrete")
Installation & Project Structure
The library is organized as a Python package with the following module layout:
Enilnets/
|-- __init__.py # Package entry point: exports NeuralNet, LRScheduler, generative classes
|-- base.py # NeuralNet class definition + method binding
|-- layers.py # Layer factory functions (add_dense, add_conv2d, etc.)
|-- forward.py # Forward pass implementation + im2col + normalization
|-- backward.py # Backpropagation for all layer types
|-- optimizer.py # Gradient update rules (SGD, Adam, RMSprop, Adagrad)
|-- loss.py # Loss function implementations
|-- activations.py # Activation functions and their derivatives
|-- weight_init.py # Weight initialization strategies
|-- train.py # Training loop, metrics, LRScheduler
|-- io.py # Model save/load (JSON & Pickle)
|-- reinforce.py # RL algorithms: Evolve, REINFORCE, PPO, ActorCritic
|-- generative/
| |-- __init__.py # Exports all generative classes and utilities
| |-- vae.py # Variational Autoencoder
| |-- gan.py # Generative Adversarial Network
| |-- diffusion.py # Denoising Diffusion Probabilistic Model
| |-- autoregressive.py # MADE-style autoregressive model
| |-- flows.py # RealNVP normalizing flow
| |-- ebm.py # Energy-Based Model
| |-- unet.py # UNet architecture for diffusion
| |-- sampling.py # Sampling utilities (reparameterize, Gumbel, etc.)
| |-- generative_loss.py # Loss functions for generative models
All layer addition methods, forward/backward passes, optimizers, loss functions, training methods, I/O, and RL methods are dynamically bound to the NeuralNet class at import time via monkey-patching in base.py. This allows each submodule to remain focused while the user interacts with a single unified API.
Core Architecture
NeuralNet Class Overview
The NeuralNet class in base.py is the central abstraction. It stores everything needed to define, train, and evaluate a neural network entirely in NumPy.
| Attribute | Type | Description |
|---|---|---|
layers |
list[dict] |
Layer definitions with weights, biases, and hyperparameters. Each layer is a dictionary containing its type-specific parameters. |
learning_rate |
float |
Global learning rate used by all optimizers. |
optimizer_type |
str |
Optimizer name: "sgd", "rmsprop", "adagrad", "adam". |
l2_lambda |
float |
L2 regularization coefficient applied to weight gradients. |
momentum |
float |
Momentum coefficient for SGD optimizer. |
outputs |
list[ndarray] |
Cached layer outputs during the most recent forward pass. outputs[0] is the input, outputs[i] is the output of layer i-1. |
pre_activations |
list[ndarray] |
Cached pre-activation values (z = Wx + b) for layers that have activations. Used during backprop for computing derivatives. |
batchnorm_cache |
list |
BatchNorm statistics cache storing (x, x_norm, mean, var, gamma, epsilon, axes) for each BatchNorm layer during training. |
layernorm_cache |
list |
LayerNorm statistics cache storing (x, x_norm, mean, var, gamma, epsilon, axes) for each LayerNorm layer. |
deltas |
list[ndarray] |
Gradient error terms per layer, computed during backpropagation. |
opt_state |
list[dict] |
Optimizer state (momentum, velocity, squared gradients) for each trainable layer. Lazily initialized on first update() call. |
t |
int |
Global timestep counter, incremented on every update() call. Used for Adam bias correction. |
training |
bool |
Training mode flag. Affects BatchNorm, Dropout, and layer behavior. |
Internal Data Flow
- Layer Definition: The user calls
add_*methods which append dictionaries toself.layers. Each dictionary stores weights, biases, and type-specific metadata. - Forward Pass:
Forward(inputs)iterates throughself.layers, computes each layer's output, and caches results inself.outputs,self.pre_activations,self.batchnorm_cache, andself.layernorm_cache. - Loss Computation:
ComputeLoss(output, target)computes the scalar loss value. - Backward Pass:
Backward(targets)computes error gradients (self.deltas) by propagating from the output layer back to the input, using cached pre-activations and layer-specific backward functions. - Parameter Update:
update()computes weight/bias gradients fromself.deltasandself.outputs, applies L2 regularization, and updates parameters using the chosen optimizer.
Training vs Evaluation Mode
- Training mode (
training=True, set via.train()): BatchNorm uses batch statistics and updates running averages. Dropout randomly zeros neurons. All caches are populated. - Evaluation mode (
training=False, set via.eval()): BatchNorm uses running statistics (no cache). Dropout is disabled (identity pass). Caches are not populated.
Layer Types
Dense Layer
A fully connected (affine) layer: output = activation(W @ input + b).
model.add_dense(n_in, n_out, activation="relu", init_method="xavier_uniform", use_bias=True)
| Parameter | Type | Default | Description |
|---|---|---|---|
n_in |
int |
required | Number of input features. |
n_out |
int |
required | Number of output features (neurons). |
activation |
str |
"relu" |
Activation function name (see Activation Functions). |
init_method |
str |
"xavier_uniform" |
Weight initialization strategy (see Weight Initialization). |
use_bias |
bool |
True |
Whether to include a bias vector. If False, bias is zeros and not updated. |
Stored in layer dict: "type": "dense", "weights" (shape (n_out, n_in)), "bias" (shape (n_out,)), "activation", "use_bias".
Sparse Layer
A dense layer with a fixed random connectivity mask. Only a fraction of weights are non-zero and trainable.
model.add_sparse(n_in, n_out, connectivity=0.5, activation="relu", init_method="xavier_uniform")
| Parameter | Type | Default | Description |
|---|---|---|---|
n_in |
int |
required | Number of input features. |
n_out |
int |
required | Number of output features. |
connectivity |
float |
0.5 |
Fraction of weights to keep (0 to 1). A mask is generated randomly and fixed for the layer's lifetime. |
activation |
str |
"relu" |
Activation function name. |
init_method |
str |
"xavier_uniform" |
Weight initialization strategy. |
Stored in layer dict: "type": "sparse", "weights", "bias", "mask" (binary matrix, same shape as weights), "activation". During forward and backward passes, the mask is applied to zero out masked weights. During updates, gradients are also masked.
Convolutional Layer (Conv2D)
A 2D convolution with no padding (valid convolution). Uses im2col for efficient matrix multiplication.
model.add_conv2d(in_ch, out_ch, k, activation="relu", init_method="he_normal", stride=1)
| Parameter | Type | Default | Description |
|---|---|---|---|
in_ch |
int |
required | Number of input channels. |
out_ch |
int |
required | Number of output channels (filters). |
k |
int |
required | Kernel size (square kernel k x k). |
activation |
str |
"relu" |
Activation function name. |
init_method |
str |
"he_normal" |
Weight initialization strategy. |
stride |
int |
1 |
Stride (stored but currently only stride=1 is fully supported by im2col). |
Input shape: (batch, in_ch, H, W)
Output shape: (batch, out_ch, H-k+1, W-k+1)
Stored in layer dict: "type": "conv2d", "weights" (shape (out_ch, in_ch, k, k)), "bias" (shape (out_ch,)), "in_ch", "out_ch", "k", "activation", "stride".
Note: There is no padding support. Each conv2d layer reduces spatial dimensions by k-1 on each side.
Flatten Layer
Reshapes a multi-dimensional tensor into a 2D matrix (batch, -1).
model.add_flatten()
Stored in layer dict: "type": "flatten". No parameters. Used to transition from conv layers to dense layers.
Max Pooling 2D
Downsamples by taking the maximum value in each p x p non-overlapping window.
model.add_maxpool2d(pool_size=2)
| Parameter | Type | Default | Description |
|---|---|---|---|
pool_size |
int |
2 |
Size of the pooling window. |
Input shape: (batch, C, H, W)
Output shape: (batch, C, H//p, W//p) (dimensions are truncated to multiples of p).
Stored in layer dict: "type": "maxpool2d", "p".
Backward pass: Uses a strided view to identify maxima and distributes gradients only to the max positions within each window.
Average Pooling 2D
Downsamples by taking the mean value in each p x p non-overlapping window.
model.add_avgpool2d(pool_size=2)
| Parameter | Type | Default | Description |
|---|---|---|---|
pool_size |
int |
2 |
Size of the pooling window. |
Input shape: (batch, C, H, W)
Output shape: (batch, C, H//p, W//p).
Stored in layer dict: "type": "avgpool2d", "p".
Backward pass: Distributes gradient evenly across all positions in each p x p window.
Global Average Pooling 2D
Computes the mean across spatial dimensions (H, W), reducing (batch, C, H, W) to (batch, C, 1, 1).
model.add_global_avgpool2d()
Stored in layer dict: "type": "globalavgpool2d". No parameters.
Backward pass: Distributes the incoming gradient evenly across all spatial positions.
Upsampling 2D
Nearest-neighbor upsampling by repeating rows and columns.
model.add_upsample2d(scale_factor=2)
| Parameter | Type | Default | Description |
|---|---|---|---|
scale_factor |
int |
2 |
Factor by which to repeat each spatial dimension. |
Input shape: (batch, C, H, W)
Output shape: (batch, C, H*scale, W*scale).
Stored in layer dict: "type": "upsample2d", "scale_factor".
Backward pass: Sums gradients from the repeated positions back to the original positions.
Batch Normalization
Normalizes activations across the batch dimension. Supports 2D (batch, features) and 4D (batch, C, H, W) inputs.
model.add_batchnorm(num_features, epsilon=1e-5, momentum=0.1)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_features |
int |
required | Number of features/channels to normalize. |
epsilon |
float |
1e-5 |
Small constant for numerical stability. |
momentum |
float |
0.1 |
Momentum for updating running statistics. running_stat = (1-momentum)*running_stat + momentum*batch_stat. |
Stored in layer dict: "type": "batchnorm", "num_features", "epsilon", "momentum", "running_mean", "running_var", "gamma" (scale, initialized to 1), "beta" (shift, initialized to 0).
Training: Computes batch mean and variance, normalizes, applies gamma/beta, updates running statistics, and stores a cache for backward.
Evaluation: Uses running mean and variance, no cache stored.
Backward: Computes gradients w.r.t. input, gamma, and beta using the cached statistics.
Layer Normalization
Normalizes across the feature dimension(s) independently for each sample. Supports 2D and 4D inputs.
model.add_layernorm(normalized_shape, epsilon=1e-5)
| Parameter | Type | Default | Description |
|---|---|---|---|
normalized_shape |
int or tuple |
required | Shape of the features to normalize. For 2D: an int (number of features). For 4D: a tuple like (C, H, W). |
epsilon |
float |
1e-5 |
Small constant for numerical stability. |
Stored in layer dict: "type": "layernorm", "normalized_shape", "epsilon", "gamma", "beta".
Unlike BatchNorm, LayerNorm has no running statistics. It always computes mean and variance on the fly. The backward pass computes dx, dgamma, and dbeta.
Dropout
Randomly zeros a fraction of activations during training for regularization.
model.add_dropout(rate=0.5)
| Parameter | Type | Default | Description |
|---|---|---|---|
rate |
float |
0.5 |
Fraction of neurons to drop (set to 0). Must be in [0, 1). |
Stored in layer dict: "type": "dropout", "rate", "mask" (binary mask created during forward pass in training mode).
Training: Each element is kept with probability 1-rate, and surviving elements are scaled by 1/(1-rate) (inverted dropout). The mask is stored for backward.
Evaluation: Identity pass, no masking.
Backward: Multiplies incoming gradient by the stored mask and the same scaling factor.
Embedding Layer
A lookup table that maps integer token indices to dense vectors.
model.add_embedding(vocab_size, embed_dim, init_method="normal")
| Parameter | Type | Default | Description |
|---|---|---|---|
vocab_size |
int |
required | Number of unique tokens in the vocabulary. |
embed_dim |
int |
required | Dimension of each embedding vector. |
init_method |
str |
"normal" |
Initialization strategy. |
Stored in layer dict: "type": "embedding", "weights" (shape (vocab_size, embed_dim)), "vocab_size", "embed_dim".
Input: Integer array of shape (batch, seq_len) or (batch,) (1D is reshaped to (batch, 1)).
Output: Embedding vectors of shape (batch, seq_len, embed_dim).
Backward: Sparse gradient -- only the rows corresponding to seen indices are updated. The _last_input key stores the input indices for gradient computation.
Forward Pass
Input Handling
The Forward(self, inputs, training=False, dropout_rate=0.0) method handles input normalization:
- 1D input
(features,)-> reshaped to(1, features)(single sample batch). - 3D input
(C, H, W)-> reshaped to(1, C, H, W)(single image batch). - 2D input
(batch, features)and 4D input(batch, C, H, W)are used as-is.
All inputs are cast to np.float64 for numerical stability.
Layer-by-Layer Computation
The forward pass iterates through self.layers in order. For each layer:
- Dense/Sparse:
z = x @ W.T + b, thenx = activation(z). Pre-activationzis cached. - Conv2D: Uses
im2colto unfold the input into columns, performs matrix multiplication with flattened kernels, reshapes back, adds bias, then applies activation. - Flatten: Reshapes to
(batch, -1). - MaxPool2D: Strided view into
p x pblocks, takes max along the block axes. - AvgPool2D: Same strided view, takes mean.
- GlobalAvgPool2D: Mean over axes
(2, 3)withkeepdims=True. - Upsample2D:
x.repeat(scale, axis=2).repeat(scale, axis=3). - BatchNorm: Normalizes using batch stats (training) or running stats (eval), then applies
gamma * x_norm + beta. - LayerNorm: Normalizes per-sample using feature stats, then applies
gamma * x_norm + beta. - Dropout: Random mask with inverted scaling during training; identity during eval.
- Embedding: Integer index lookup into the weight matrix.
After each layer, the output is appended to self.outputs, pre-activations to self.pre_activations, and normalization caches to self.batchnorm_cache / self.layernorm_cache.
im2col for Convolutions
The im2col(input_data, filter_h, filter_w, stride=1, pad=0) function converts image batches into column matrices suitable for efficient matrix multiplication:
- Pads the input with zeros if
pad > 0. - Uses NumPy's
as_stridedto create a view where each receptive field is a row. - Transposes and reshapes to
(N * out_h * out_w, C * filter_h * filter_w).
This allows convolutions to be computed as a single large matrix multiplication: output = col @ W_flat.T, which is significantly faster than nested loops in pure NumPy.
Backward Pass
Automatic Delta Computation
Backward(self, targets=None, output_delta=None) supports two modes:
Mode 1: Automatic (targets provided)
model.Backward(ys)
- If the last layer uses
"softmax"activation, the delta is computed as(out - targets) / batch_size(the combined softmax + cross-entropy gradient simplification). - Otherwise, delta =
(out - targets) * derivative(activation, pre_activation) / batch_size.
Mode 2: Manual (output_delta provided)
model.Backward(None, output_delta=custom_delta)
- Used in reinforcement learning and generative models where the output gradient is computed externally.
output_deltais reshaped to(batch, features)if 1D.
Per-Layer Gradient Propagation
After computing the output delta, the backward pass iterates from the second-to-last layer back to the first:
For each layer l (current) and l+1 (next), it computes the error err flowing into layer l based on the next layer's type:
| Next Layer Type | Error Computation |
|---|---|
dense / sparse |
err = next_delta @ W_next |
flatten |
err = next_delta.reshape(outputs[l+1].shape) |
conv2d |
err = conv2d_backward_input(next_delta, W_next, outputs[l+1].shape) -- transposed convolution via im2col |
maxpool2d |
err = maxpool2d_backward(next_delta, outputs[l+1], pool_size) -- routes gradient to max positions |
avgpool2d |
err = avgpool2d_backward(next_delta, outputs[l+1], pool_size) -- distributes gradient evenly |
globalavgpool2d |
err = globalavgpool2d_backward(next_delta, outputs[l+1]) -- distributes over spatial dims |
upsample2d |
err = upsample2d_backward(next_delta, outputs[l+1], scale) -- sums repeated positions |
dropout |
err = next_delta * mask / (1 - rate) (or identity if mask is None) |
batchnorm |
err, dgamma, dbeta = batchnorm_backward(next_delta, cache) -- stores d_gamma, d_beta on the layer dict |
layernorm |
err, dgamma, dbeta = layernorm_backward(next_delta, cache) -- stores d_gamma, d_beta on the layer dict |
embedding |
err = next_delta (gradient flows back to previous layer) |
Then, if the current layer has an activation (dense, sparse, conv2d), the error is multiplied by the activation derivative evaluated at the pre-activation: self.deltas[l] = err * derivative(activation, pre_activation).
Important: For BatchNorm and LayerNorm, the backward pass requires that Forward(training=True) was called first to populate the caches. If a cache is None, a ValueError is raised.
Optimizers
All optimizers are implemented in optimizer.py and applied in the update() method. The optimizer is selected via the optimizer parameter in NeuralNet.__init__().
SGD with Momentum
model = NeuralNet(optimizer="sgd", learning_rate=0.01, momentum=0.9)
Update rule for weights (same structure for biases, gamma, beta):
velocity = momentum * velocity - learning_rate * gradient
weight += velocity
Momentum accumulates velocity in the direction of persistent gradients, helping escape shallow local minima and accelerating convergence in consistent gradient directions.
RMSprop
model = NeuralNet(optimizer="rmsprop", learning_rate=0.001)
Update rule:
v = 0.999 * v + 0.001 * gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)
RMSprop adapts the learning rate per parameter by dividing by a running average of squared gradients. This helps with non-stationary objectives and sparse gradients.
Adagrad
model = NeuralNet(optimizer="adagrad", learning_rate=0.01)
Update rule:
v += gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)
Adagrad accumulates all historical squared gradients. It performs larger updates for infrequent parameters and smaller updates for frequent ones. Note that the learning rate naturally decays over time.
Adam with Bias Correction
model = NeuralNet(optimizer="adam", learning_rate=0.001)
Adam combines momentum (first moment) and RMSprop (second moment) with bias correction:
m = 0.9 * m + 0.1 * gradient # first moment
v = 0.999 * v + 0.001 * gradient^2 # second moment
m_hat = m / (1 - 0.9^t) # bias-corrected first moment
v_hat = v / (1 - 0.999^t) # bias-corrected second moment
weight -= learning_rate * m_hat / (sqrt(v_hat) + 1e-8)
Where t is the global timestep incremented on each update() call. Bias correction prevents the initial estimates from being biased toward zero.
L2 Regularization
L2 regularization (weight decay) is applied to all weight gradients before the optimizer step:
grad_w = grad_w + l2_lambda * weights * mask
The mask term ensures that sparse layers only regularize their active connections. L2 regularization penalizes large weights, encouraging simpler models and reducing overfitting.
Optimizer state initialization: On the first call to update(), self.opt_state is lazily initialized with zero-initialized momentum/velocity buffers matching the shape of each layer's trainable parameters.
Loss Functions
All loss functions are implemented in loss.py via ComputeLoss(self, output, target, function="mse", reduction="mean", **kwargs).
Regression Losses
MSE (Mean Squared Error)
model.ComputeLoss(output, target, function="mse", reduction="mean")
loss = (output - target)^2
Standard regression loss. Penalizes large errors quadratically.
MAE (Mean Absolute Error)
model.ComputeLoss(output, target, function="mae", reduction="mean")
loss = |output - target|
More robust to outliers than MSE since errors are not squared.
Huber Loss
model.ComputeLoss(output, target, function="huber", delta=1.0, reduction="mean")
loss = 0.5 * diff^2 if diff < delta
loss = delta * (diff - 0.5*delta) otherwise
Combines MSE for small errors and MAE for large errors. delta controls the transition point.
Smooth L1 Loss
model.ComputeLoss(output, target, function="smooth_l1", reduction="mean")
Huber loss with delta=1.0 hardcoded. Commonly used in object detection.
Classification Losses
Binary Cross-Entropy
model.ComputeLoss(output, target, function="binary_cross_entropy", reduction="mean")
loss = -(target * log(output) + (1 - target) * log(1 - output))
For binary classification with sigmoid output. Output is clipped to [1e-12, 1-1e-12] for numerical stability.
Cross-Entropy / Categorical Cross-Entropy
model.ComputeLoss(output, target, function="cross_entropy", reduction="mean")
loss = -target * log(output)
For multi-class classification with softmax output. Output clipped to [1e-12, 1.0]. Supports "mean", "sum", and "none" (per-element) reduction.
Note: When reduction="mean", the loss is divided by the batch size. When "sum", it is summed. When "none", the raw per-element loss array is returned.
Focal Loss
model.ComputeLoss(output, target, function="focal", alpha=0.25, gamma=2.0, reduction="mean")
Down-weights easy examples and focuses on hard examples:
pt = output * target + (1 - output) * (1 - target)
loss = -(alpha * target * (1-pt)^gamma * log(output) + (1-alpha) * (1-target) * pt^gamma * log(1-output))
Useful for imbalanced datasets. alpha balances positive/negative examples, gamma focuses on hard examples.
Hinge Loss
model.ComputeLoss(output, target, function="hinge", reduction="mean")
loss = max(0, 1 - target * output)
For SVM-style classification. Target should be +1 or -1.
BCE with Logits (Numerically Stable)
model.ComputeLoss(output, target, function="bce_logits", reduction="mean")
loss = max(output, 0) - output * target + log(1 + exp(-|output|))
Computes binary cross-entropy directly from logits (pre-sigmoid values) without explicitly computing sigmoid, avoiding numerical issues for extreme values.
Advanced Losses
Wasserstein Loss
model.ComputeLoss(output, target, function="wasserstein", reduction="mean")
loss = -output * target
Used in Wasserstein GANs. Target is +1 for real and -1 for fake.
Cosine Similarity Loss
model.ComputeLoss(output, target, function="cosine_similarity", reduction="mean")
loss = 1 - cos(output, target)
Measures the cosine distance between vectors. Useful for embedding learning and contrastive tasks.
Triplet Loss
model.ComputeLoss(output, target, function="triplet", margin=1.0, negative=neg_samples, reduction="mean")
d_pos = ||anchor - positive||^2
d_neg = ||anchor - negative||^2
loss = max(0, d_pos - d_neg + margin)
output is the anchor, target is the positive, and negative kwarg provides the negative samples. Used in metric learning (e.g., FaceNet).
NT-Xent (Normalized Temperature-scaled Cross Entropy)
model.ComputeLoss(output, target, function="ntxent", temperature=0.5, reduction="mean")
SimCLR contrastive loss. Computes pairwise cosine similarities, applies temperature scaling, and uses a cross-entropy formulation where positive pairs are on the diagonal.
Generative Losses
KL Divergence (for VAE)
model.ComputeLoss(output, target, function="kl_divergence", mu=mu, logvar=logvar, reduction="mean")
loss = -0.5 * sum(1 + logvar - mu^2 - exp(logvar), axis=-1)
KL divergence between the approximate posterior q(z|x) and the prior N(0, I). Used as a regularization term in VAEs.
Activation Functions
All activations and their derivatives are implemented in activations.py.
ReLU Family
| Name | Forward | Derivative | Notes |
|---|---|---|---|
relu |
max(0, x) |
1 if x > 0 else 0 |
Most common default. |
leakyrelu |
x if x > 0 else 0.01*x |
1 if x > 0 else 0.01 |
Small negative slope prevents dying ReLU. |
elu |
x if x > 0 else exp(x)-1 |
1 if x > 0 else exp(x) |
Smooth negative region, mean closer to zero. |
selu |
scale * (x if x>0 else alpha*(exp(x)-1)) |
scale * (1 if x>0 else alpha*exp(x)) |
Self-normalizing; alpha=1.6733, scale=1.0507. |
Sigmoid Family
| Name | Forward | Derivative | Notes |
|---|---|---|---|
sigmoid |
1 / (1 + exp(-x)) |
sigmoid(x) * (1 - sigmoid(x)) |
Clipped to [-500, 500] to prevent overflow. |
tanh |
tanh(x) |
1 - tanh(x)^2 |
Zero-centered output. |
softmax |
exp(x - max(x)) / sum(exp(x - max(x))) |
Handled specially in backward | Numerically stable via max subtraction. |
softplus |
log(1 + exp(x)) |
sigmoid(x) |
Smooth approximation of ReLU. |
Advanced Activations
| Name | Forward | Derivative | Notes |
|---|---|---|---|
gelu |
0.5*x*(1 + tanh(sqrt(2/pi)*(x + 0.044715*x^3))) |
CDF + x*PDF |
Used in Transformer architectures. |
swish |
x * sigmoid(x) |
sigmoid(x) + x*sigmoid(x)*(1-sigmoid(x)) |
Self-gated, smooth. |
mish |
x * tanh(log(1 + exp(x))) |
tanh(sp) + x*sigmoid(x)*(1-tanh(sp)^2) |
sp = softplus(x). Smooth and self-regularizing. |
Linear / Identity
| Name | Forward | Derivative | Notes |
|---|---|---|---|
linear |
x |
1 |
No transformation. Used for output layers before softmax/sigmoid. |
Weight Initialization
All initializers are in weight_init.py and automatically called by layer addition methods.
Dense & Sparse Layer Initializers
For a layer with n_in inputs and n_out outputs:
| Method | Formula | Best For |
|---|---|---|
xavier_uniform |
U(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out))) |
Tanh/sigmoid activations |
xavier_normal |
N(0, sqrt(2/(n_in+n_out))) |
Tanh/sigmoid activations |
he_uniform |
U(-sqrt(6/n_in), sqrt(6/n_in)) |
ReLU activations |
he_normal |
N(0, sqrt(2/n_in)) |
ReLU activations (default for conv) |
normal |
N(0, 0.1) |
General purpose, small initial values |
orthogonal |
SVD-based orthogonal matrix | RNNs, preserving gradient norms |
zeros |
All zeros | Biases, or when you want to start from zero |
ones |
All ones | Special cases |
Convolutional Layer Initializers
Same methods as dense, but fan-in is computed as in_ch * k * k (number of input connections per filter):
| Method | Formula |
|---|---|
xavier_uniform |
U(-sqrt(6/(in_ch*k*k + out_ch)), ...) |
xavier_normal |
N(0, sqrt(2/(in_ch*k*k + out_ch))) |
he_uniform |
U(-sqrt(6/(in_ch*k*k)), ...) |
he_normal |
N(0, sqrt(2/(in_ch*k*k))) (default) |
normal |
N(0, 0.1) |
orthogonal |
SVD on (out_ch, in_ch*k*k), reshaped to (out_ch, in_ch, k, k) |
zeros |
All zeros |
ones |
All ones |
Embedding Layer Initializers
| Method | Formula |
|---|---|
normal |
N(0, 0.1) (default) |
xavier_uniform |
U(-sqrt(6/(vocab_size + embed_dim)), ...) |
xavier_normal |
N(0, sqrt(2/(vocab_size + embed_dim))) |
zeros |
All zeros |
Note: All weights and biases are stored as np.float64 for maximum numerical precision.
Training Utilities
Train Method
The Train method in train.py provides a complete training loop with validation support, metric tracking, and learning rate scheduling.
history = model.Train(
X_train, Y_train,
epochs=10, batch_size=32,
X_val=None, Y_val=None,
loss_function=None,
verbose=True,
scheduler=None,
**loss_kwargs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
X_train |
ndarray |
required | Training inputs. |
Y_train |
ndarray |
required | Training targets. |
epochs |
int |
10 |
Number of training epochs. |
batch_size |
int |
32 |
Batch size for mini-batch gradient descent. |
X_val |
ndarray or None |
None |
Validation inputs. If provided, validation metrics are computed each epoch. |
Y_val |
ndarray or None |
None |
Validation targets. |
loss_function |
str or None |
None |
Loss function name. If None, auto-detects based on last layer activation ("cross_entropy" for softmax, "mse" otherwise). |
verbose |
bool |
True |
Whether to print progress. |
scheduler |
LRScheduler or None |
None |
Learning rate scheduler instance. |
**loss_kwargs |
Additional arguments passed to ComputeLoss (e.g., delta for Huber loss). |
Returns: history dict with keys "loss", "accuracy", "val_loss", "val_accuracy", "lr".
Training loop details:
- If a scheduler is provided,
scheduler.step(epoch)is called at the start of each epoch to update the learning rate. - Training data is shuffled each epoch.
- Batches are processed sequentially. For each batch:
TrainBatchis called (forward, loss, backward, update).- Loss and accuracy are weighted by actual batch size (handles last incomplete batch).
- Epoch averages are computed and stored in history.
- If validation data is provided, the model runs in eval mode for validation (though
Forwardis called withtraining=False, the caches are not used for updates).
TrainBatch Method
loss, out = model.TrainBatch(xs, ys, loss_function=None, **loss_kwargs)
A single training step that:
- Calls
Forward(xs, training=True) - Auto-detects loss function if not provided
- Computes loss via
ComputeLoss - Calls
Backward(ys) - Calls
update()to apply gradients
Returns the scalar loss and the network output.
Learning Rate Schedulers
The LRScheduler class in train.py supports multiple decay strategies:
from Enilnets import LRScheduler
# Step decay: halve LR every 10 epochs
scheduler = LRScheduler(initial_lr=0.001, mode="step", drop=0.5, epochs_drop=10)
# Exponential decay: multiply by 0.95 each epoch
scheduler = LRScheduler(initial_lr=0.001, mode="exponential", decay=0.95)
# Cosine annealing
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=100)
# Warmup + cosine
scheduler = LRScheduler(initial_lr=0.001, mode="warmup_cosine", max_epochs=100, warmup_epochs=5)
| Mode | Formula | Parameters |
|---|---|---|
"step" |
lr * drop^(epoch // epochs_drop) |
drop=0.5, epochs_drop=10 |
"exponential" |
lr * decay^epoch |
decay=0.95 |
"cosine" |
lr * 0.5 * (1 + cos(pi * epoch / max_epochs)) |
max_epochs=100 |
"warmup_cosine" |
Linear warmup then cosine | max_epochs=100, warmup_epochs=5 |
"plateau" |
Returns initial_lr (placeholder) | None |
The scheduler's step(epoch) method returns the learning rate for that epoch. The Train method calls self.set_lr(lr) before each epoch.
Metrics Computation
Accuracy
acc = model.compute_accuracy(predictions, targets)
- Multi-class (
predictions.shape[-1] > 1): Comparesargmaxof predictions vs targets. - Binary (
predictions.shape[-1] == 1): Thresholds at 0.5.
Returns mean accuracy as a float.
Precision, Recall, F1
metrics = model.compute_precision_recall_f1(predictions, targets)
# Returns: {"precision": float, "recall": float, "f1": float}
Binary classification metrics. Uses the same multi-class/binary detection as accuracy. Computed with 1e-12 epsilon to prevent division by zero.
Model Utilities
Training / Evaluation Mode
model.train() # Set training=True, returns self (for chaining)
model.eval() # Set training=False, returns self
These affect BatchNorm (batch vs running stats) and Dropout (active vs identity).
Learning Rate Control
model.set_lr(0.0001) # Set learning rate
lr = model.get_lr() # Get current learning rate
Gradient Clipping
model.clip_gradients(max_norm=1.0)
Clips the L2 norm of all deltas across all layers:
- Computes
total_norm = sqrt(sum(||delta||^2))over all non-None deltas. - If
total_norm > max_norm, scales all deltas bymax_norm / total_norm.
Call this after Backward() and before update() to prevent exploding gradients.
Layer Freezing / Unfreezing
model.freeze() # Freeze all layers
model.freeze(2) # Freeze layer index 2 only
model.unfreeze() # Unfreeze all layers
model.unfreeze(2) # Unfreeze layer index 2 only
Frozen layers are skipped during update(). The _frozen flag is checked in the optimizer loop. This is useful for transfer learning and fine-tuning.
Weight Get / Set
weights = model.get_weights() # Returns list of dicts, one per layer
model.set_weights(weights) # Restores weights from list of dicts
get_weights() copies "weights", "bias", "gamma", "beta", and "mask" for each layer. set_weights() restores them. Useful for checkpointing, model averaging, and transfer.
Model Copying
net_copy = model.copy()
Creates a deep copy of the network including layers and optimizer state. The new network has the same architecture, weights, and optimizer buffers but is an independent object.
Optimizer State Reset
model.reset_optimizer_state()
Clears all optimizer momentum/velocity buffers and resets the timestep t to 0. Useful when starting training on a new dataset or after significant hyperparameter changes.
NaN / Inf Detection
issues = model.check_nan_inf()
# Returns list of strings like ["Layer 2 weights has NaN/Inf", "Delta 5 has NaN/Inf"]
Checks all weights, biases, gamma, beta, and deltas for non-finite values. Returns an empty list if everything is clean. Call this periodically during training to catch numerical instability early.
Model Summary
model.summary()
Prints a formatted table showing:
- Optimizer, learning rate, L2 lambda
- Per-layer information: type, input/output shapes, parameter counts
- Total parameter count
Example output:
Model Summary
======================================================================
Optimizer: ADAM | LR: 0.001 | L2: 0.01
======================================================================
Layer 0: DENSE Input: 784 Output: 256 Params: 200960
Layer 1: DROPOUT
Layer 2: DENSE Input: 256 Output: 10 Params: 2570
Total Parameters: 203530
======================================================================
Reinforcement Learning
All RL methods are in reinforce.py and bound to NeuralNet.
Evolutionary Strategy (Evolve)
best_score = model.Evolve(inputs, score_fn, noise=0.05, tries=10, sigma=1.0)
A black-box optimization method that perturbs network weights with Gaussian noise and keeps the best variant.
| Parameter | Type | Default | Description |
|---|---|---|---|
inputs |
ndarray |
required | Input data to evaluate the network on. |
score_fn |
callable |
required | Function that takes network output and returns a scalar score (higher is better). |
noise |
float |
0.05 |
Standard deviation scale for weight perturbations. |
tries |
int |
10 |
Number of candidate networks to try. |
sigma |
float |
1.0 |
Additional scaling factor for noise. |
Algorithm:
- Evaluate the current network on
inputsto get a baseline score. - For each try, create a deep copy of the network, add Gaussian noise to all weights and biases (respecting sparse masks), evaluate the candidate.
- If the candidate scores higher, keep it as the new best.
- Restore the best network.
Returns the best score achieved.
REINFORCE (Policy Gradient)
mean_return = model.Reinforce(
states, actions, returns,
action_type="discrete", std=1.0, normalize_returns=True
)
Monte-Carlo policy gradient method.
| Parameter | Type | Default | Description |
|---|---|---|---|
states |
ndarray |
required | Observed states, shape (N, features). |
actions |
ndarray |
required | Discrete: (N,) integer indices. Continuous: (N, action_dim). |
returns |
ndarray |
required | Discounted returns for each state-action pair, shape (N,) or (N, 1). |
action_type |
str |
"discrete" |
"discrete" (categorical) or "continuous" (Gaussian). |
std |
float |
1.0 |
Standard deviation for continuous Gaussian policy. |
normalize_returns |
bool |
True |
Whether to z-score normalize returns before computing gradients. |
Discrete actions: Network output is treated as action probabilities. The gradient is (out - one_hot(actions)) * returns / batch_size.
Continuous actions: Network output is treated as action means. The gradient is -(actions - means) / std^2 * returns / batch_size.
Returns the mean of the raw (un-normalized) returns.
Proximal Policy Optimization (PPO)
policy_loss = model.PPO(
states, actions, old_log_probs, advantages,
action_type="discrete", epsilon=0.2, std=1.0,
value_targets=None, value_coeff=0.5, entropy_coeff=0.01
)
PPO is a policy gradient method that clips the objective to prevent overly large policy updates.
| Parameter | Type | Default | Description |
|---|---|---|---|
states |
ndarray |
required | Observed states, shape (N, features). |
actions |
ndarray |
required | Discrete: (N,) integers. Continuous: (N, action_dim). |
old_log_probs |
ndarray |
required | Log probabilities under the old policy, shape (N, 1). |
advantages |
ndarray |
required | Advantage estimates, shape (N, 1). |
action_type |
str |
"discrete" |
"discrete" or "continuous". |
epsilon |
float |
0.2 |
Clipping parameter. |
std |
float |
1.0 |
Fixed std for continuous Gaussian policy. |
value_targets |
ndarray or None |
None |
Target values for value head (not yet implemented in gradient). |
value_coeff |
float |
0.5 |
Coefficient for value loss (reserved). |
entropy_coeff |
float |
0.01 |
Coefficient for entropy bonus (encourages exploration). |
Discrete PPO:
- Computes action probabilities from network output.
- Computes log probabilities for the taken actions.
- Computes probability ratio
ratio = exp(new_log_prob - old_log_prob). - Computes clipped surrogate objective:
min(ratio * advantage, clip(ratio, 1-eps, 1+eps) * advantage). - Approximates policy gradient: for each sample, if not clipped, gradient flows through the taken action proportional to
-advantage / prob. - Adds entropy gradient:
(1 + log_probs) * entropy_coeff / batch_size.
Continuous PPO: Uses Gaussian log probabilities and computes gradient as -(actions - means) / std^2 * advantages / batch_size.
Returns the mean policy loss (negative of the clipped objective).
Actor-Critic
value_loss = model.ActorCritic(
states, actions, returns, values,
action_type="discrete", std=1.0
)
Combines policy gradient with a value function baseline.
| Parameter | Type | Default | Description |
|---|---|---|---|
states |
ndarray |
required | Observed states. |
actions |
ndarray |
required | Taken actions. |
returns |
ndarray |
required | Discounted returns, shape (N, 1). |
values |
ndarray |
required | Predicted values from value network, shape (N, 1). |
action_type |
str |
"discrete" |
"discrete" or "continuous". |
std |
float |
1.0 |
Std for continuous actions. |
Algorithm:
- Computes advantages:
advantages = returns - values. - Uses the same policy gradient as REINFORCE but weighted by advantages instead of raw returns.
- Returns the mean squared advantage (a proxy for value function error).
Note: This implementation uses a single network. In practice, you may want separate actor and critic networks or a network with two output heads.
RL Utility Functions
compute_returns
from Enilnets import compute_returns
returns = compute_returns(rewards, gamma=0.99)
Computes discounted returns for a single episode:
G_t = reward_t + gamma * G_{t+1}
Iterates backwards through the reward array. Returns an array of the same shape.
gae (Generalized Advantage Estimation)
from Enilnets.generative.sampling import gae
advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)
| Parameter | Type | Default | Description |
|---|---|---|---|
rewards |
ndarray |
required | Step rewards, shape (T,). |
values |
ndarray |
required | Value estimates including bootstrap, shape (T+1,). |
gamma |
float |
0.99 |
Discount factor. |
lambda_ |
float |
0.95 |
GAE lambda (0 = high bias, 1 = high variance). |
Computes TD-residuals delta_t = reward_t + gamma * V(s_{t+1}) - V(s_t), then accumulates them with exponential decay: A_t = delta_t + gamma * lambda * A_{t+1}. Returns (advantages, returns) where returns = advantages + values[:T].
Generative AI Framework
All generative models are in the generative/ subpackage and imported from the top-level Enilnets package.
Variational Autoencoder (VAE)
File: generative/vae.py
A VAE learns a probabilistic latent representation of data. It consists of an encoder (maps data to latent distribution parameters) and a decoder (maps latent samples back to data).
from Enilnets import VAE
vae = VAE(
input_dim=784, latent_dim=32,
encoder_hidden=[512, 256],
decoder_hidden=[256, 512],
activation="swish",
learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
| Parameter | Type | Default | Description |
|---|---|---|---|
input_dim |
int |
required | Dimensionality of input data (flattened). |
latent_dim |
int |
required | Dimensionality of the latent space. |
encoder_hidden |
list[int] |
[512, 256] |
Hidden layer sizes for the encoder. |
decoder_hidden |
list[int] |
[256, 512] |
Hidden layer sizes for the decoder. |
activation |
str |
"swish" |
Activation for hidden layers. |
learning_rate |
float |
0.001 |
Learning rate for both encoder and decoder. |
optimizer |
str |
"adam" |
Optimizer type. |
l2_lambda |
float |
0.0 |
L2 regularization. |
Architecture:
- Encoder: Dense layers with specified activation, final layer outputs
latent_dim * 2values (mu and logvar) with linear activation. - Decoder: Dense layers with specified activation, final layer outputs
input_dimvalues with sigmoid activation (assumes data in [0, 1]).
Methods:
| Method | Signature | Description |
|---|---|---|
encode |
encode(x) -> (mu, logvar) |
Maps input to latent distribution parameters. |
decode |
decode(z) -> recon |
Maps latent samples to reconstructed data. |
forward |
forward(x) -> (recon, mu, logvar, z) |
Full forward pass through encoder + reparameterization + decoder. |
loss |
loss(x, recon=None, mu=None, logvar=None) -> float |
Computes reconstruction loss (binary cross-entropy) + KL divergence. |
train_step |
train_step(x) -> float |
One training step: forward, backward through decoder, backward through encoder, update both networks. |
Train |
Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] |
Full training loop. Returns list of average losses per epoch. |
generate |
generate(n_samples=1) -> ndarray |
Samples from N(0, I) in latent space and decodes. |
reconstruct |
reconstruct(x) -> ndarray |
Encodes and decodes input (reconstruction). |
interpolate |
interpolate(x1, x2, n_steps=10) -> ndarray |
Linear interpolation in latent space between two inputs. |
Training details:
- Forward: encode -> reparameterize -> decode.
- Decoder backward: computes
d_recon = (recon - x) / batch_size, multiplies by sigmoid derivativerecon * (1 - recon), backpropagates through decoder, updates decoder weights. - Encoder backward: computes gradient of latent samples w.r.t. decoder input, combines with reparameterization gradients to get
d_muandd_logvar, backpropagates through encoder, updates encoder weights. - Loss = binary cross-entropy reconstruction + KL(q(z|x) || N(0, I)).
Generative Adversarial Network (GAN)
File: generative/gan.py
A GAN trains a generator to produce realistic data and a discriminator to distinguish real from fake.
from Enilnets import GAN
gan = GAN(
latent_dim=100, data_dim=784,
generator_hidden=[256, 512],
discriminator_hidden=[512, 256],
g_activation="swish", d_activation="leakyrelu",
loss_type="bce",
learning_rate=0.0002, optimizer="adam", l2_lambda=0.0
)
| Parameter | Type | Default | Description |
|---|---|---|---|
latent_dim |
int |
required | Dimensionality of the noise vector. |
data_dim |
int |
required | Dimensionality of generated data. |
generator_hidden |
list[int] |
[256, 512] |
Hidden layer sizes for generator. |
discriminator_hidden |
list[int] |
[512, 256] |
Hidden layer sizes for discriminator. |
g_activation |
str |
"swish" |
Generator hidden activation. |
d_activation |
str |
"leakyrelu" |
Discriminator hidden activation. |
loss_type |
str |
"bce" |
"bce", "bce_logits", or "wasserstein". |
learning_rate |
float |
0.0002 |
LR for both networks. |
optimizer |
str |
"adam" |
Optimizer type. |
l2_lambda |
float |
0.0 |
L2 regularization. |
Architecture:
- Generator: Dense layers with
g_activation, final layer withtanhactivation. - Discriminator: Dense layers with
d_activation, final layer withsigmoid(for BCE) orlinear(for Wasserstein).
Methods:
| Method | Signature | Description |
|---|---|---|
generate |
generate(n_samples) -> ndarray |
Samples noise and runs generator forward. |
discriminate |
discriminate(x) -> ndarray |
Runs discriminator forward. |
Train |
Train(X_train, epochs=10, batch_size=64, d_steps=1, g_steps=1, verbose=True) -> dict |
Alternates discriminator and generator training. |
sample |
sample(n_samples=16) -> ndarray |
Alias for generate. |
Loss types:
| Type | Discriminator Target | Generator Gradient |
|---|---|---|
"bce" |
Real=1, Fake=0 | -1 / D(fake) |
"bce_logits" |
Real=1, Fake=0 | Logits-based stable gradient |
"wasserstein" |
Real=1, Fake=-1 | -1 (constant) |
Training loop:
- For each batch, train discriminator for
d_stepsiterations on real + fake data. - Train generator for
g_stepsiterations by backpropagating through the discriminator to get gradients w.r.t. generator input. - Track and report D_loss and G_loss per epoch.
Diffusion Model (DDPM)
File: generative/diffusion.py
Implements Denoising Diffusion Probabilistic Models (DDPM). The model learns to reverse a gradual noising process.
from Enilnets import DiffusionModel
diffusion = DiffusionModel(
data_shape=(784,), time_steps=1000,
beta_schedule="linear", beta_start=1e-4, beta_end=0.02,
denoiser_type="mlp", denoiser_hidden=[512, 512, 512],
learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
| Parameter | Type | Default | Description |
|---|---|---|---|
data_shape |
tuple |
required | Shape of data. (D,) for flattened, (C, H, W) for images. |
time_steps |
int |
1000 |
Number of diffusion timesteps. |
beta_schedule |
str |
"linear" |
"linear" or "cosine" noise schedule. |
beta_start |
float |
1e-4 |
Starting beta value (linear schedule). |
beta_end |
float |
0.02 |
Ending beta value (linear schedule). |
denoiser_type |
str |
"mlp" |
"mlp" or "conv" denoiser architecture. |
denoiser_hidden |
list[int] |
[512, 512, 512] |
Hidden sizes for MLP denoiser. |
learning_rate |
float |
0.001 |
Learning rate. |
optimizer |
str |
"adam" |
Optimizer type. |
l2_lambda |
float |
0.0 |
L2 regularization. |
Noise schedules:
- Linear:
betas = linspace(beta_start, beta_end, time_steps) - Cosine: Uses a cosine-squared schedule with offset
s=0.008for smoother noise addition.
Precomputed constants (computed in __init__):
alphas = 1 - betasalphas_cumprod = cumprod(alphas)-- cumulative product of alphassqrt_alphas_cumprod,sqrt_one_minus_alphas_cumprod-- for forward diffusionsqrt_recip_alphas,posterior_variance-- for reverse diffusion
Methods:
| Method | Signature | Description |
|---|---|---|
train_step |
train_step(x_0) -> float |
One training step: sample timestep t, add noise, predict noise, compute MSE loss, backpropagate. |
Train |
Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] |
Full training loop. |
sample |
sample(n_samples=16, shape=None, clip=True) -> ndarray |
Generate samples by iteratively denoising from pure noise. |
denoise |
denoise(x_noisy, t_start, t_end=0) -> ndarray |
Denoise a partially noised input from timestep t_start down to t_end. |
Denoiser architectures:
- MLP: Concatenates flattened input with sinusoidal time embedding, passes through dense layers.
- Conv: Stack of conv2d layers. Time embedding is broadcast spatially and added to feature maps.
Forward diffusion: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
Reverse diffusion: At each step, predict noise, compute mean of p(x_{t-1} | x_t), add scaled noise (except at t=0).
Autoregressive Model (MADE)
File: generative/autoregressive.py
A masked autoregressive model that enforces causality -- each output dimension only depends on previous dimensions.
from Enilnets import AutoregressiveModel
ar = AutoregressiveModel(
data_dim=784, hidden_dims=[512, 512],
data_shape=(28, 28), activation="swish",
learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
| Parameter | Type | Default | Description |
|---|---|---|---|
data_dim |
int |
required | Total number of dimensions. |
hidden_dims |
list[int] |
[512, 512] |
Hidden layer sizes. |
data_shape |
tuple or None |
None |
Original shape for reshaping output (e.g., (28, 28)). |
activation |
str |
"swish" |
Hidden activation. |
learning_rate |
float |
0.001 |
Learning rate. |
optimizer |
str |
"adam" |
Optimizer type. |
l2_lambda |
float |
0.0 |
L2 regularization. |
Architecture: Standard MLP with linear output activation.
Causality enforcement: A lower-triangular mask (with zeros on and above diagonal) is applied to the input before feeding it to the network: x_masked[i] = sum_{j < i} mask[i,j] * x[j]. This ensures the i-th output only sees dimensions 0 through i-1.
Methods:
| Method | Signature | Description |
|---|---|---|
forward |
forward(x, training=True) -> ndarray |
Causal forward pass. |
loss |
loss(x) -> float |
MSE between predictions and targets. |
train_step |
train_step(x) -> float |
One training step with custom backpropagation. |
Train |
Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] |
Full training loop. |
generate |
generate(n_samples=1, shape=None) -> ndarray |
Autoregressive sampling: predict dim 0, use it to predict dim 1, etc. |
complete |
complete(partial_x, n_dims=None) -> ndarray |
Complete a partial sample by autoregressively filling remaining dimensions. |
Generation: Starts with zeros, iteratively predicts each dimension and adds small Gaussian noise for diversity.
Normalizing Flows (RealNVP)
File: generative/flows.py
Real-valued Non-Volume Preserving (RealNVP) flow for density estimation and sampling.
from Enilnets import RealNVP
flow = RealNVP(
data_dim=784, n_coupling=4, hidden_dim=256,
activation="swish",
learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
| Parameter | Type | Default | Description |
|---|---|---|---|
data_dim |
int |
required | Dimensionality of data. |
n_coupling |
int |
4 |
Number of coupling layers. |
hidden_dim |
int |
256 |
Hidden dimension for s and t networks. |
activation |
str |
"swish" |
Activation for coupling networks. |
learning_rate |
float |
0.001 |
Learning rate. |
optimizer |
str |
"adam" |
Optimizer type. |
l2_lambda |
float |
0.0 |
L2 regularization. |
Architecture: Each coupling layer has two networks:
- s_net (scale): Predicts log-scale factors. Output activation is
tanh. - t_net (translation): Predicts translation. Output activation is
linear.
Both are 3-layer MLPs: data_dim//2 -> hidden_dim -> hidden_dim -> data_dim - data_dim//2.
Coupling transform: For input split into x1 and x2:
y2 = x2 * exp(s(x1)) + t(x1)
output = concat(x1, y2)
log_det += sum(s(x1))
Alternating masks (even/odd splits) ensure all dimensions are transformed.
Methods:
| Method | Signature | Description |
|---|---|---|
forward |
forward(x) -> (z, log_det) |
Maps data to latent space. Returns transformed data and log determinant. |
inverse |
inverse(z) -> x |
Maps latent samples back to data space. |
log_prob |
log_prob(x) -> ndarray |
Computes log probability: log p(z) + log_det where p(z) is N(0, I). |
loss |
loss(x) -> float |
Negative mean log probability. |
train_step |
train_step(x) -> float |
Returns loss (training uses Evolve, see below). |
Train |
Train(X_train, epochs=10, batch_size=64, verbose=True) -> list[float] |
Trains each coupling layer sequentially using evolutionary strategy. |
sample |
sample(n_samples=1) -> ndarray |
Samples from base distribution and applies inverse transform. |
interpolate |
interpolate(x1, x2, n_steps=10) -> ndarray |
Linear interpolation in latent space. |
Training: Uses Evolve (evolutionary strategy) rather than analytical backprop through the log-determinant Jacobian. Each coupling layer is trained while keeping previous layers fixed.
Energy-Based Model (EBM)
File: generative/ebm.py
An energy-based model that assigns low energy to real data and high energy to generated samples.
from Enilnets import EnergyBasedModel
ebm = EnergyBasedModel(
data_dim=784, hidden_dims=[512, 512],
activation="swish",
learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
| Parameter | Type | Default | Description |
|---|---|---|---|
data_dim |
int |
required | Dimensionality of data. |
hidden_dims |
list[int] |
[512, 512] |
Hidden layer sizes for energy network. |
activation |
str |
"swish" |
Hidden activation. |
learning_rate |
float |
0.001 |
Learning rate. |
optimizer |
str |
"adam" |
Optimizer type. |
l2_lambda |
float |
0.0 |
L2 regularization. |
Architecture: MLP mapping data to a scalar energy value. Final layer has linear activation.
Methods:
| Method | Signature | Description |
|---|---|---|
energy |
energy(x) -> ndarray |
Computes scalar energy for input data, shape (batch, 1). |
_energy_grad |
_energy_grad(x) -> (energy, grad) |
Finite-difference gradient of energy w.r.t. input (eps=1e-4). |
train_step |
train_step(x_data, n_cd_steps=10, step_size=0.1, noise_scale=0.005) -> float |
One contrastive divergence step. |
Train |
Train(X_train, epochs=10, batch_size=64, n_cd_steps=10, step_size=0.1, noise_scale=0.005, verbose=True) -> list[float] |
Full training loop. |
sample |
sample(n_samples=1, n_steps=100, step_size=0.1, noise_scale=0.005) -> ndarray |
Langevin dynamics sampling from random initialization. |
score |
score(x) -> ndarray |
Returns the energy gradient (score function). |
Training (Contrastive Divergence):
- Generate negative samples by running Langevin dynamics from random noise for
n_cd_stepsiterations. - Compute energy on real data and negative samples.
- Update network to push down energy on real data (target=1) and push up on negative samples (target=-1).
- Loss = mean(energy(data) - energy(negative_samples)).
Sampling: Langevin dynamics iterates x = x - step_size * grad(energy) + noise_scale * N(0, I).
UNet Denoiser
File: generative/unet.py
A UNet architecture for spatial denoising, designed for use with diffusion models.
from Enilnets import UNetDenoiser, time_embedding
unet = UNetDenoiser(
in_ch=1, base_ch=64, time_emb_dim=128,
ch_mult=(1, 2, 4)
)
| Parameter | Type | Default | Description |
|---|---|---|---|
in_ch |
int |
required | Number of input channels. |
base_ch |
int |
64 |
Base number of channels. |
time_emb_dim |
int |
128 |
Dimensionality of time embedding. |
ch_mult |
tuple |
(1, 2, 4) |
Channel multipliers for each encoder level. |
Architecture:
- Time embedding MLP:
time_emb_dim -> time_emb_dim*4 -> time_emb_dim*4with swish activation. - Encoder path:
len(ch_mult)levels. Each level has 2 conv2d layers (k=1) with swish activation. Time embedding is added to features (broadcast spatially if channel dims match). - Downsampling: Average pooling by factor 2 between encoder levels.
- Bottleneck: 2 conv2d layers at the deepest level.
- Decoder path: Mirrors encoder with skip connections. Upsamples, concatenates with skip, applies 2 conv2d layers.
- Output: Conv2d (k=1) mapping back to
in_chchannels with linear activation.
Methods:
| Method | Signature | Description |
|---|---|---|
forward |
forward(x, t) -> ndarray |
Full UNet forward pass with time conditioning. |
backward |
backward(grad_output) |
Raises NotImplementedError. |
get_params |
get_params() -> list[NeuralNet] |
Returns all sub-networks for external optimization. |
Important: The UNet uses k=1 convolutions to avoid spatial dimension changes (since the library has no padding support). The backward method is not implemented; use DiffusionModel with denoiser_type="mlp" for fully trainable diffusion, or implement custom backpropagation for the UNet.
time_embedding function:
emb = time_embedding(t, dim, max_period=10000)
Sinusoidal time embedding used in diffusion models:
freqs = exp(-log(max_period) * arange(dim//2) / (dim//2))
emb = [sin(t * freqs), cos(t * freqs)]
If dim is odd, a zero column is appended.
Sampling Utilities
File: generative/sampling.py
VAE Reparameterization
from Enilnets import reparameterize
z = reparameterize(mu, logvar)
The reparameterization trick: z = mu + exp(0.5 * logvar) * eps where eps ~ N(0, I).
Enables backpropagation through stochastic nodes by separating the randomness from the parameters.
Langevin Dynamics
from Enilnets import langevin_dynamics
x_sampled = langevin_dynamics(energy_fn, x_init, n_steps=20, step_size=0.1, noise_scale=0.005)
Langevin Monte Carlo for sampling from energy-based models:
for step in range(n_steps):
energy, grad = energy_fn(x)
x = x - step_size * grad + noise_scale * N(0, I)
| Parameter | Type | Default | Description |
|---|---|---|---|
energy_fn |
callable |
required | Function taking x and returning (energy, grad_energy). |
x_init |
ndarray |
required | Initial samples. |
n_steps |
int |
20 |
Number of Langevin steps. |
step_size |
float |
0.1 |
Gradient descent step size. |
noise_scale |
float |
0.005 |
Standard deviation of injected noise. |
Gaussian & Uniform Sampling
from Enilnets import gaussian_sample, uniform_sample
# Gaussian: N(mean, std^2)
samples = gaussian_sample(mean, std, shape=None)
# If shape is None, uses mean.shape
# Uniform: U(low, high)
samples = uniform_sample(low, high, shape)
Gumbel-Softmax
from Enilnets import gumbel_softmax_sample
samples = gumbel_softmax_sample(logits, temperature=1.0, hard=False)
Differentiable sampling from categorical distributions:
- Add Gumbel noise:
y = logits + Gumbel(0, 1) - Apply temperature-scaled softmax:
y_soft = softmax(y / temperature) - If
hard=True, use straight-through estimator:y_hard - y_soft + y_soft(discrete forward, continuous backward).
Useful for training models with discrete latent variables.
Random Masking
from Enilnets import random_mask
mask = random_mask(shape, ratio)
Generates a binary mask where each element is 1 with probability ratio (keep ratio), 0 otherwise. Returns np.float64 array.
Top-p (Nucleus) Sampling
from Enilnets.generative.sampling import top_p_sampling
samples = top_p_sampling(logits, p=0.9, temperature=1.0)
Nucleus sampling for text generation:
- Convert logits to probabilities with temperature scaling.
- Sort probabilities in descending order.
- Find the smallest set of tokens whose cumulative probability exceeds
p. - Sample from this truncated distribution.
Always keeps at least the top token. Returns one-hot encoded samples.
Discounted Returns
from Enilnets import compute_returns
returns = compute_returns(rewards, gamma=0.99)
Computes cumulative discounted returns for a single episode:
G_t = reward_t + gamma * G_{t+1}
Iterates backward through the reward array. Returns array of same shape.
Generalized Advantage Estimation (GAE)
from Enilnets.generative.sampling import gae
advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)
| Parameter | Type | Default | Description |
|---|---|---|---|
rewards |
ndarray |
required | Step rewards, shape (T,). |
values |
ndarray |
required | Value estimates including bootstrap V(s_{T+1}), shape (T+1,). |
gamma |
float |
0.99 |
Discount factor. |
lambda_ |
float |
0.95 |
GAE lambda parameter. |
Algorithm:
for t in reversed(range(T)):
delta = rewards[t] + gamma * values[t+1] - values[t]
gae_t = delta + gamma * lambda_ * gae_t
advantages[t] = gae_t
returns = advantages + values[:T]
GAE provides a bias-variance tradeoff controlled by lambda_: lambda_=0 gives high-bias TD(0), lambda_=1 gives high-variance Monte Carlo.
Model I/O
File: io.py
JSON Serialization
model.Save("model.json")
Saves the model to a human-readable JSON file. All NumPy arrays are converted to Python lists via a custom encoder. Stores:
version: 3layers: All layer dictionariesoptimizer: Optimizer type stringlearning_rate,l2_lambda,momentum: Hyperparameterst: Global timestep
Note: JSON does not preserve NumPy array types; arrays are stored as nested lists.
Pickle Serialization
model.Save("model.pkl")
Saves the model to a binary pickle file. Preserves NumPy arrays exactly. More compact and faster than JSON. Detected automatically by .pkl extension.
Loading Models
model = NeuralNet() # Create a new instance
model.Load("model.json") # or "model.pkl"
The Load method:
- Detects file format by extension (
.pklfor pickle, otherwise JSON). - Loads the payload.
- Reconstructs layers, converting list data back to
np.float64arrays for all weight-related keys (weights,bias,mask,gamma,beta,running_mean,running_var). - Restores hyperparameters (
learning_rate,optimizer_type,l2_lambda,momentum,t). - Resets optimizer state (
opt_state = []) -- you may want to callreset_optimizer_state()after loading.
Important: The loaded model does not restore optimizer momentum buffers. If you need to resume training exactly, you would need to save and load opt_state separately (not currently supported).
Known Limitations
-
No padding in Conv2D: All convolutions use
pad=0, so spatial dimensions shrink byk-1per layer. The UNet works around this by usingk=1convolutions. -
UNet backward not implemented: The
UNetDenoiser.backward()method raisesNotImplementedError. UseDiffusionModelwithdenoiser_type="mlp"for fully trainable diffusion, or implement custom backpropagation. -
Flows use evolutionary training:
RealNVPusesEvolve(black-box optimization) rather than analytical backprop through the log-determinant Jacobian. This is slower and less precise than gradient-based training. -
GAN training can be unstable: As with all GANs, convergence depends heavily on architecture, learning rates, and data. The library provides three loss types but no spectral normalization or other advanced stabilization techniques.
-
No GPU acceleration: Pure NumPy implementation runs on CPU only. Large models and datasets will be slow.
-
Stride support is limited: While
strideis stored in conv2d layers, theim2colimplementation only fully supportsstride=1. -
No recurrent layers: No LSTM, GRU, or vanilla RNN support. For sequence modeling, use the embedding layer with dense layers or implement custom recurrence.
-
No automatic differentiation: Gradients are hand-coded for each layer type. Adding new layers requires implementing both forward and backward passes.
-
Optimizer state not saved:
Save()/Load()do not persist optimizer momentum/velocity buffers. Training resumes from scratch in terms of optimizer state. -
Single precision not supported: All computations use
np.float64. This provides numerical stability but uses more memory thanfloat32.
Version History
v2.0.0
Major update including:
- New layer types: LayerNorm, Embedding, GlobalAvgPool2D, Upsample2D, Sparse
- Learning rate schedulers: Step decay, exponential decay, cosine annealing, warmup+cosine, plateau
- Reinforcement learning: REINFORCE, PPO, Actor-Critic, Evolutionary Strategy
- Gradient clipping: L2 norm clipping across all layers
- Layer freezing/unfreezing: Fine-grained control over which layers train
- NaN/Inf detection:
check_nan_inf()for debugging numerical issues - Improved BatchNorm: Full 2D and 4D support with proper running statistics
- New loss functions: Cosine similarity, triplet margin, NT-Xent (SimCLR), focal loss, BCE with logits, Wasserstein loss
- Generative AI framework: VAE, GAN, Diffusion Model, Autoregressive Model, RealNVP, Energy-Based Model, UNet Denoiser
- Sampling utilities: Reparameterization, Langevin dynamics, Gumbel-Softmax, top-p sampling, GAE
- Perceptual loss utilities: Placeholder functions for VGG-based perceptual loss
- Model copying and state reset:
copy()andreset_optimizer_state()
Complete API Reference
NeuralNet Methods
| Method | Description |
|---|---|
__init__(learning_rate, optimizer, l2_lambda, momentum) |
Constructor |
summary() |
Print architecture summary |
add_dense(n_in, n_out, activation, init_method, use_bias) |
Add fully connected layer |
add_sparse(n_in, n_out, connectivity, activation, init_method) |
Add sparse connected layer |
add_conv2d(in_ch, out_ch, k, activation, init_method, stride) |
Add 2D convolution |
add_flatten() |
Add flatten layer |
add_maxpool2d(pool_size) |
Add max pooling |
add_avgpool2d(pool_size) |
Add average pooling |
add_global_avgpool2d() |
Add global average pooling |
add_upsample2d(scale_factor) |
Add 2x upsampling |
add_batchnorm(num_features, epsilon, momentum) |
Add batch normalization |
add_layernorm(normalized_shape, epsilon) |
Add layer normalization |
add_dropout(rate) |
Add dropout regularization |
add_embedding(vocab_size, embed_dim, init_method) |
Add embedding lookup layer |
Forward(inputs, training, dropout_rate) |
Forward pass |
predict(inputs) |
Alias for Forward |
train() |
Set training mode |
eval() |
Set evaluation mode |
set_lr(lr) |
Set learning rate |
get_lr() |
Get learning rate |
freeze(layer_idx) |
Freeze layer(s) |
unfreeze(layer_idx) |
Unfreeze layer(s) |
clip_gradients(max_norm) |
Clip gradient norms |
get_weights() |
Copy all weights |
set_weights(weights) |
Restore weights |
copy() |
Deep copy network |
reset_optimizer_state() |
Clear optimizer buffers |
check_nan_inf() |
Check for NaN/Inf |
Backward(targets, output_delta) |
Backpropagation |
update() |
Apply parameter updates |
TrainBatch(xs, ys, loss_function, **kwargs) |
Train one batch |
Train(X, Y, epochs, batch_size, X_val, Y_val, loss_function, verbose, scheduler, **kwargs) |
Full training loop |
ComputeLoss(out, tgt, function, reduction, **kwargs) |
Compute loss |
compute_accuracy(pred, tgt) |
Compute classification accuracy |
compute_precision_recall_f1(pred, tgt) |
Compute precision, recall, F1 |
Evolve(inputs, score_fn, noise, tries, sigma) |
Evolutionary strategy |
Reinforce(states, actions, returns, action_type, std, normalize_returns) |
Policy gradient |
PPO(states, actions, old_log_probs, advantages, action_type, epsilon, std, value_targets, value_coeff, entropy_coeff) |
Proximal Policy Optimization |
ActorCritic(states, actions, returns, values, action_type, std) |
Actor-Critic |
Save(file) |
Save model to file |
Load(file) |
Load model from file |
Generative Classes
| Class | Module | Description |
|---|---|---|
VAE |
generative.vae |
Variational Autoencoder |
GAN |
generative.gan |
Generative Adversarial Network |
DiffusionModel |
generative.diffusion |
DDPM diffusion model |
AutoregressiveModel |
generative.autoregressive |
MADE-style autoregressive model |
RealNVP |
generative.flows |
RealNVP normalizing flow |
EnergyBasedModel |
generative.ebm |
Energy-based model |
UNetDenoiser |
generative.unet |
UNet for diffusion denoising |
LRScheduler |
train |
Learning rate scheduler |
Generative Loss Functions
| Function | Module | Description |
|---|---|---|
kl_divergence_gaussian(mu, logvar, reduction) |
generative_loss |
KL(q(z|x) || N(0,I)) |
adversarial_loss_discriminator(real_logits, fake_logits, loss_type) |
generative_loss |
Discriminator loss (BCE/BCE_logits/Wasserstein) |
adversarial_loss_generator(fake_logits, loss_type) |
generative_loss |
Generator loss |
diffusion_loss(pred_noise, true_noise, reduction) |
generative_loss |
MSE noise prediction |
nll_loss(log_px, log_det_jacobian, reduction) |
generative_loss |
Flow negative log-likelihood |
energy_loss(data_energy, sample_energy, margin) |
generative_loss |
EBM contrastive loss |
perceptual_loss(x, y, feature_extractor) |
generative_loss |
Perceptual loss (falls back to MSE) |
vgg_loss(x, y) |
generative_loss |
Placeholder for VGG perceptual loss |
Sampling Functions
| Function | Module | Description |
|---|---|---|
reparameterize(mu, logvar) |
sampling |
VAE reparameterization trick |
langevin_dynamics(energy_fn, x_init, n_steps, step_size, noise_scale) |
sampling |
MCMC sampling for EBMs |
gaussian_sample(mean, std, shape) |
sampling |
Gaussian sampling |
uniform_sample(low, high, shape) |
sampling |
Uniform sampling |
gumbel_softmax_sample(logits, temperature, hard) |
sampling |
Differentiable categorical sampling |
random_mask(shape, ratio) |
sampling |
Random boolean mask |
top_p_sampling(logits, p, temperature) |
sampling |
Nucleus (top-p) sampling |
compute_returns(rewards, gamma) |
sampling |
Discounted returns |
gae(rewards, values, gamma, lambda_) |
sampling |
Generalized Advantage Estimation |
Weight Initialization Functions
| Function | Description |
|---|---|
init_weights(n_in, n_out, method) |
Dense/sparse weight initialization |
init_conv_weights(in_ch, out_ch, k, method) |
Conv2D weight initialization |
init_embedding_weights(vocab_size, embed_dim, method) |
Embedding weight initialization |
Utility Functions
| Function | Description |
|---|---|
im2col(input_data, filter_h, filter_w, stride, pad) |
Convert images to column matrix for efficient convolution |
time_embedding(t, dim, max_period) |
Sinusoidal time embedding for diffusion models |
activate(name, x) |
Apply activation function |
derivative(name, x) |
Compute activation derivative |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file enilnets-2.1.0.tar.gz.
File metadata
- Download URL: enilnets-2.1.0.tar.gz
- Upload date:
- Size: 95.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccf3348443f28e724a7777ca69ad5bc47e20ce3bf233998a3252d11c80ebf684
|
|
| MD5 |
9e022ca5618743c173deb82a8f4d9d06
|
|
| BLAKE2b-256 |
2cb0c57282b1a914cea629dacc67fe61275d3827780df3f5317e47b4cc4d500d
|
File details
Details for the file enilnets-2.1.0-py3-none-any.whl.
File metadata
- Download URL: enilnets-2.1.0-py3-none-any.whl
- Upload date:
- Size: 57.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2498d0a6065dbe2ddb631ea3217f1f37349c922f1532659ce10b8d4df769c305
|
|
| MD5 |
7ef685e61115f5b578601b9ed9179401
|
|
| BLAKE2b-256 |
2617e0120a4073357d46595edceb36c8d189b843fcc2203d32375b8c507bfb4c
|