The Statistical Backbone of Neural Networks: Why Activation and Loss Functions Matter

TLDR: This paper explains that the choice of activation and loss functions in neural networks is not arbitrary but statistically justified. It shows how common loss functions like Mean Squared Error (MSE), Mean Absolute Error (MAE), and Cross-Entropy are derived from Maximum Likelihood Estimation (MLE) under specific probability distributions (Gaussian, Laplace, Bernoulli, Multinomial). It also connects these to Generalized Linear Models (GLMs) and discusses specialized functions for positive value regression and handling extreme outliers (fat tails).

Neural networks are powerful tools for making predictions, from identifying objects in images to forecasting stock prices. At the heart of how these networks learn and make decisions are two fundamental components: activation functions and loss functions. A recent research paper, “DL101 Neural Network Outputs and Loss Functions” by Fernando Berzal, delves into the statistical justifications behind these crucial choices, explaining why certain functions are naturally suited for specific tasks.

Understanding Activation Functions

Activation functions determine the output of a neuron, transforming the weighted sum of its inputs into a final signal. These functions are often non-linear, allowing neural networks to learn complex patterns. For instance:

The linear (identity) function is primarily used in the output layer for regression tasks, where the goal is to predict a continuous numerical value. It simply passes the input through without change.

The logistic function (sigmoid) is an S-shaped function that squashes its input into the range . This makes it ideal for binary classification problems, where the output can be interpreted as a probability.

The hyperbolic tangent (tanh) function is similar to the sigmoid but maps inputs to the range [-1,1]. Being zero-centered, it can sometimes lead to faster training convergence.

For problems with multiple categories, the softmax function is typically used in the output layer. It converts a vector of numbers into a probability distribution, ensuring that the probabilities for all classes sum up to 1.

Rectified Linear Units (ReLU) are a popular choice for hidden layers in deep learning models. They are computationally efficient, outputting the input directly if it’s positive, and zero otherwise. Variants like Leaky ReLU address potential issues where neurons might “die” by always outputting zero.

The softplus function is a smooth, differentiable approximation of ReLU, also ensuring positive outputs and having stronger ties to likelihood theory.

The Role of Loss Functions: Measuring Error

While activation functions shape a neuron’s output, loss functions quantify how far off a model’s predictions are from the actual values. The paper emphasizes that the choice of loss function is far from arbitrary; it’s statistically justified, often stemming from the principle of Maximum Likelihood Estimation (MLE).

Maximum Likelihood Estimation (MLE) is a method for estimating model parameters by maximizing the probability (or likelihood) of observing the training data. Essentially, minimizing the negative log-likelihood of the data is equivalent to maximizing its likelihood. This means that choosing a loss function is akin to assuming a specific probability distribution for your data’s errors.

Loss Functions for Regression

In regression problems, where continuous values are predicted, common loss functions include:

Mean Squared Error (MSE): This calculates the average of the squared differences between predicted and actual values. MSE heavily penalizes large errors due to the squaring, making it sensitive to outliers. Statistically, minimizing MSE is equivalent to MLE under the assumption of Gaussian (normal) noise in the data. It aims to find the conditional mean of the output.

Mean Absolute Error (MAE): MAE treats all errors linearly, making it more robust and less sensitive to outliers than MSE. It represents the average error in the same units as the target variable. Minimizing MAE is justified by assuming Laplace noise in the data and aims to find the conditional median of the output.

Loss Functions for Classification

For classification tasks, where categories are predicted, cross-entropy losses are paramount:

Binary Cross-Entropy (BCE): Used for binary classification (two outcomes). It measures the “distance” between predicted probabilities and true labels (0 or 1). Minimizing BCE is equivalent to MLE under the assumption of a Bernoulli distribution, encouraging the model to output well-calibrated probabilities.

Categorical Cross-Entropy (CCE): For multi-class classification (more than two outcomes). It measures how well the predicted probability distribution across classes matches the true one-hot encoded label. Minimizing CCE is equivalent to MLE under the assumption of a multinomial distribution, ensuring the model assigns high probability to the correct class.

The Generalized Linear Model Connection

The paper highlights that the output layer of many deep neural networks can be viewed as a Generalized Linear Model (GLM). This framework provides a strong statistical foundation, where the activation function of the output neurons corresponds to the inverse of a canonical link function, and the loss function is derived from the assumed probability distribution of the response variable. For example, a linear output layer with MSE loss aligns with a Gaussian GLM, while a sigmoid output with BCE loss aligns with a Bernoulli GLM.

Handling Special Situations

The research also explores less common but important scenarios:

Binary Classification with Bipolar Encoding: When target classes are encoded as +1 and -1 instead of 0 and 1, the hyperbolic tangent (tanh) activation function becomes the suitable choice for minimizing binary cross-entropy.

Regression of Positive Values: For predicting inherently positive quantities like prices or counts, standard regression might not enforce positivity. Solutions include using activation functions like ReLU or softplus, transforming the target variable (e.g., predicting the logarithm of the value), or using loss functions derived from distributions like the Gamma or Poisson, which are naturally suited for positive or count data, respectively.

Dealing with Fat Tails: In some datasets, extreme values (outliers) are more common than a Gaussian distribution would predict. These are called “fat-tailed” distributions. For such cases, assuming a double symmetric Pareto distribution for noise leads to a logarithmic loss function. This loss is extremely robust to outliers because its gradient decays, effectively down-weighting the influence of very large errors without ignoring them entirely.

Also Read:

Conclusion

In essence, the choice of activation and loss functions in deep learning is not merely a matter of convenience or empirical success. As Fernando Berzal’s paper meticulously details, these choices are deeply rooted in statistical principles, particularly Maximum Likelihood Estimation and the framework of Generalized Linear Models. By aligning the network’s output layer and its error measurement with the underlying statistical properties of the data, we can build more effective, robust, and theoretically sound machine learning models.