Negative log likelihood (NLL) as a cost function for classification, explained!

Let’s start by defining the function under consideration. In Probability, Likelihood function describes the joint probability of observing an event (y) in a sample set of data, as a function of its parameters. It can be mathematically described as

L(y|$\theta$)=P(observing Event y with parameter $\theta$)

The ‘|’ symbol denotes limiting our view to fit the condition $\theta$.

Practical Example

Let’s take an example of a common classification problem in supervised learning to understand the above description. The goal is to identify the presence of cat in an image.

One way to create this model is by iteratively increasing the likelihood that our model predicts the correct output i.e we need to optimize

L(y | $\theta$) = L(Model predicting correct output | evidence $\theta$) (A)

Imagine we have initialized the parameters of the model and we feed the model an image from the training set. The model guesses an arbitrary probability ($Y^{'}$) guessing the presence of cat in the image.

Assume $Y^{'}=0.3$ and that the example image was indeed an image of a cat (evidence $\theta_{1}$). Hence the likelihood that our model is predicting the correct output is

$$L(y | \theta_{1})\ = Y^{'} * P(\theta_{1})$$

Here $P(\theta_{1})$ denotes the probability with which the evidence fits the criteria/hypothesis. Since we are dealing with binary classification the probability can be 0 or 1 and since our data is labelled correctly, the probability is always 1.

$$L(y|\theta_{1}) = 0.3$$

Similarly if the image in question was not that of a cat $(\theta_{2})$, our prediction of the correct class is $1-Y^{'}$, i.e

$$L(y|\theta_{2})\ = 0.7$$

From the above equations, we can see that the likelihood function perfectly indicates the correctness of our model and so can be used to optimize our network. Extending the likelihood equation (A) to the entire set of images in the training network, and since the likelihood of independent events multiply, the final equation for likelihood for the entire training set is $\\ \prod_{k=1}^{n}L_{k}$ where $L_{k}$ is $L(y_{k} | \theta_{k})$ ,with $L_{k}$, $y_{k}$ being the likelihood and the event that the Model predicts correct output for image $k$. Hence the final equation is as follows

$$L(y |\theta) = \prod_{k=1}^{n}L_{k}=\prod_{k=1}^{n}Y_{k}\ \ \ (B)$$

In our example, $L_{k}$ is same as $Y_{k}$, the correct class probability of the model but let’s stick with likelihood $L_{k}$ for upcoming computations for generalization purpose.

Reason for NLL

For ease of computation, logarithm is usually applied, which translates the above equation to sum of log likelihood. This offers two significant advantages

Speed: Multiplying large amount of values is computationally more expensive when compared to addition

Accuracy: The final likelihood value from equation (B) can be very small and applying log increases the accuracy of the function because of the way computers approximate very small real numbers

$$log\ L(y|\theta)\ = \sum_{k=1}^{n}logL_{k}$$

As log is negative in the interval (0,1), we multiple the above function with -1, so that we can deal with positive values. The negative log likelihood will provide a measure of the error of the model since likelihood measures the correctness of the model. Minimization of error being the preferred method of optimization, this fits in perfectly.

$$-log\ L(y|\theta)\ = -\sum_{k=1}^{n}logL_{k}$$

Finally we find the average of above function for m samples so that we get the error contributed by a single sample which can be later optimized by updating the model weights using methods such as gradient descent. This function is called as Cost function of the model.

$$Cost\ Fn=\frac{-1}{m}\sum_{k=1}^{n}logL_{k}$$

The above equation which is the negative of average log likelihood is also called as Information entropy and it has wide range of applications in Information theory.

Graphical interpretation

From the graph below of $y=-log(x)$ we can see that $y->\infty$ when $x→0$ and $y=0$ when $x=1$ which is exactly what an error measuring function should do, i.e unlimited error when model is 0% correct and zero error when model is 100% correct.

Thus we have discussed the likelihood function in Probability and why negative log likelihood is used as cost function for classification tasks in machine learning with an example.

Footnotes

Likelihood is a key part of the Bayes theorem and it was later taken up independently in various areas for statistical modelling. The theorem provides a way to find conditional probability of events by updating probability using new evidence/data.

References

Log probability - Wikipedia

Bayes theorem, the geometry of changing beliefs - YouTube