Log-Likelihood From First Principles

I kept noticing the same pattern when reading about log-likelihood: most explanations either stayed too abstract (“it’s just the log of the likelihood”) or jumped straight into optimization without showing where the object actually comes from.

That is a problem because log-likelihood sits in the middle of a lot of machine learning that otherwise looks unrelated. Binary classification, softmax regression, cross-entropy loss, maximum likelihood estimation, and even the training objective for large language models all reduce to the same core idea: assign high probability to what actually happened, then optimize that objective in a numerically stable way.

In this post, I want to build log-likelihood from first principles and treat it like a study note rather than a glossary entry. I’ll start from joint probability, derive likelihood carefully, take the log and show what simplifies, connect it to cross-entropy, and then walk through how it appears in binary classifiers, softmax models, and LLM training. I’ll also show why extreme probabilities create numerical instability, and why the same objective that makes models accurate can also make them overconfident.

The Problem

The phrase “maximize likelihood” sounds simple until you try to use it in a real model.

Suppose I have a dataset of observations:

\mathcal{D} = \{x^{(1)}, x^{(2)}, \dots, x^{(n)}\}

and a model with parameters $\theta$ that assigns probabilities to data points. In principle, I want to choose $\theta$ so the observed data looks as plausible as possible under the model.

That sounds straightforward, but a few issues show up immediately.

First, individual probabilities multiply across examples. If I assume samples are independent, the probability of the whole dataset becomes a product of many terms. Products of small numbers get tiny very quickly.

Second, optimization on products is awkward. Sums are much easier to differentiate, reason about, and implement.

Third, the loss functions engineers use in practice often appear under different names. In one place it is “negative log-likelihood.” In another it is “binary cross-entropy.” In another it is “categorical cross-entropy.” In language modeling it becomes “next-token prediction loss.” These are often the same object viewed from slightly different angles.

Fourth, probability objectives have side effects. When I maximize likelihood aggressively, I often push predicted probabilities toward 0 or 1. That improves the likelihood on training data, but it can also make the model badly calibrated and overly certain.

So the real question is not just “what is log-likelihood?” The more useful question is:

How does log-likelihood arise naturally from probability, why does taking the log help so much, and why does this same objective connect to classification, softmax, cross-entropy, and LLM training?

Initial Thinking

My own first mental model of likelihood was too vague. I treated it as “probability, but with parameters.” That is not wrong, but it is not enough to work with.

A few incomplete intuitions usually show up early:

“Likelihood is just probability.” Close, but not exactly. The formula can look identical, but the role of the variable changes.

“Taking the log is only a mathematical trick.” It is a trick, but also much more than that. It changes products into sums, preserves optima because log is monotonic, and gives losses with clean gradients.

“Cross-entropy is a different concept from likelihood.” In most supervised classification settings, minimizing cross-entropy is exactly the same as maximizing log-likelihood.

“Overconfidence comes from using deep networks, not the objective.” Architecture matters, but the training objective itself pushes probability mass toward observed labels. That pressure is part of the story.

The clean way through this is to start from the probability of the data and be explicit about every step.

Breakdown

From joint probability to likelihood

Start with a probabilistic model $p(x \mid \theta)$ , where $\theta$ are parameters and $x$ is observed data.

Given a dataset $\mathcal{D} = \{x^{(1)}, \dots, x^{(n)}\}$ , the joint probability of the entire dataset under the model is

p(\mathcal{D} \mid \theta) = p(x^{(1)}, x^{(2)}, \dots, x^{(n)} \mid \theta)

If I assume the examples are independent and identically distributed, this becomes

p(\mathcal{D} \mid \theta) = \prod_{i=1}^{n} p(x^{(i)} \mid \theta)

This quantity is the probability of observing the dataset under parameter setting $\theta$ .

Now the key shift:

In probability, I usually think of $\theta$ as fixed and $x$ as variable.
In likelihood, I treat the observed data $x$ as fixed and think of the same expression as a function of $\theta$ .

So I define the likelihood as

L(\theta; \mathcal{D}) = p(\mathcal{D} \mid \theta)

Under the independence assumption,

L(\theta; \mathcal{D}) = \prod_{i=1}^{n} p(x^{(i)} \mid \theta)

Same formula. Different perspective.

That perspective matters because now I can ask:

\theta^* = \operatorname*{arg\max}_{\theta} L(\theta; \mathcal{D})

This is maximum likelihood estimation.

Why take the log?

The product form is mathematically correct but operationally annoying. So I take the log:

\log L(\theta; \mathcal{D}) = \log \left( \prod_{i=1}^{n} p(x^{(i)} \mid \theta) \right)

Using $\log(ab)=\log a + \log b$ ,

\log L(\theta; \mathcal{D}) = \sum_{i=1}^{n} \log p(x^{(i)} \mid \theta)

This is the log-likelihood.

Because log is strictly increasing, maximizing likelihood and maximizing log-likelihood are equivalent:

\operatorname*{arg\max}_{\theta} L(\theta; \mathcal{D}) = \operatorname*{arg\max}_{\theta} \log L(\theta; \mathcal{D})

This one step does three important things:

It converts a product into a sum.
It makes gradients easier to compute.
It improves numerical stability, because sums of logs are easier to represent than products of tiny probabilities.

In practice, we usually minimize the negative log-likelihood:

\mathcal{L}_{\text{NLL}}(\theta) = - \sum_{i=1}^{n} \log p(x^{(i)} \mid \theta)

That is just a convention so we can use minimization-based optimizers.

A concrete binary classification derivation

Now let me make this real with binary labels $y^{(i)} \in \{0,1\}$ .

Suppose the model predicts

p(y=1 \mid x; \theta) = \hat{y}

and therefore

p(y=0 \mid x; \theta) = 1 - \hat{y}

A compact way to write the probability of the observed label is

p(y \mid x; \theta) = \hat{y}^{y}(1-\hat{y})^{1-y}

Why does this work?

If $y=1$ , then $p(y \mid x; \theta)=\hat{y}$
If $y=0$ , then $p(y \mid x; \theta)=1-\hat{y}$

For a dataset:

L(\theta) = \prod_{i=1}^{n} \left(\hat{y}^{(i)}\right)^{y^{(i)}} \left(1-\hat{y}^{(i)}\right)^{1-y^{(i)}}

Take the log:

\log L(\theta) = \sum_{i=1}^{n} \left[ y^{(i)} \log \hat{y}^{(i)} + (1-y^{(i)}) \log (1-\hat{y}^{(i)}) \right]

Negate it:

\mathcal{L}_{\text{NLL}}(\theta) = -\sum_{i=1}^{n} \left[ y^{(i)} \log \hat{y}^{(i)} + (1-y^{(i)}) \log (1-\hat{y}^{(i)}) \right]

That is the familiar binary cross-entropy loss.

So binary cross-entropy is not just inspired by likelihood. It is literally the negative log-likelihood for a Bernoulli model.

Why this loss punishes bad confident predictions so strongly

Let me isolate one example.

If the true label is $y=1$ , the loss is

-\log(\hat{y})

If the model predicts:

$\hat{y}=0.9$ , loss is about $0.105$
$\hat{y}=0.5$ , loss is about $0.693$
$\hat{y}=0.01$ , loss is about $4.605$

That shape is the point. Being confidently wrong is much worse than being uncertain.

This is exactly what I want if the goal is to fit probabilities to observed outcomes. But it also hints at why models can become overconfident: the objective rewards moving the correct class probability upward as much as possible.

Connection to cross-entropy

Now I want to show the deeper equivalence.

Cross-entropy between a true distribution $p$ and a model distribution $q$ is

H(p, q) = - \sum_x p(x) \log q(x)

In supervised learning, I usually treat the empirical data distribution as the truth. For classification with one-hot labels, the target distribution for one sample is:

p(y=k \mid x) = \begin{cases} 1 & \text{if } k = y_{\text{true}} \\ 0 & \text{otherwise} \end{cases}

Then cross-entropy for one example becomes

H(p, q) = - \sum_{k} p_k \log q_k

Since only the true class has $p_k=1$ , this simplifies to

H(p, q) = -\log q_{y_{\text{true}}}

That is exactly the per-example negative log-likelihood.

For a dataset:

\frac{1}{n}\sum_{i=1}^{n} H(p^{(i)}, q^{(i)}) = -\frac{1}{n}\sum_{i=1}^{n} \log q(y^{(i)} \mid x^{(i)})

So minimizing cross-entropy is equivalent to maximizing log-likelihood.

There is also a more general identity:

H(p, q) = H(p) + D_{\mathrm{KL}}(p \parallel q)

Since $H(p)$ does not depend on the model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence from the true distribution to the model.

That gives a useful interpretation:

Likelihood view: make observed data probable.
Cross-entropy view: make the model distribution match the data distribution.
KL view: reduce the inefficiency of using the model distribution instead of the true one.

These are the same training pressure seen through three different lenses.

Why logs are so natural in information theory too

The appearance of the log is not an accident.

The quantity

-\log q(x)

is the surprisal of event $x$ under model $q$ . Rare events have high surprisal. Common events have low surprisal.

Then cross-entropy is just expected surprisal under the true distribution:

H(p, q) = \mathbb{E}_{x \sim p}[-\log q(x)]

So minimizing negative log-likelihood means I am training the model to be less surprised by the outcomes that actually happen.

That interpretation is especially useful when we move to language models.

Softmax: multi-class log-likelihood

For multi-class classification, I usually start with logits $z_1, z_2, \dots, z_K$ . These are unnormalized scores. Softmax converts them into probabilities:

p(y=k \mid x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

Suppose the true class is $t$ . Then the likelihood for that example is

p(y=t \mid x) = \frac{e^{z_t}}{\sum_{j=1}^{K} e^{z_j}}

Take the negative log:

-\log p(y=t \mid x) = -\log \left( \frac{e^{z_t}}{\sum_{j=1}^{K} e^{z_j}} \right)

Use log rules:

-\log p(y=t \mid x) = -z_t + \log \sum_{j=1}^{K} e^{z_j}

This is one of the most important expressions in practical machine learning:

\mathcal{L} = \log \sum_{j=1}^{K} e^{z_j} - z_t

This is the softmax cross-entropy loss written directly in terms of logits.

A few things become obvious from this form:

increasing the true class logit $z_t$ lowers the loss
increasing competing logits raises the loss
the $\log \sum_j e^{z_j}$ term normalizes all scores into a probability distribution

This is also where the log-sum-exp trick comes from, which I’ll get to when I discuss numerical stability.

Binary classifier implementation from the likelihood view

Let me implement a simple logistic regression classifier from scratch, but using the negative log-likelihood directly so the connection stays visible.

For binary classification:

z = w^\top x + b

\hat{y} = \sigma(z) = \frac{1}{1+e^{-z}}

Loss for one example:

\ell(\hat{y}, y) = -\left[y \log \hat{y} + (1-y)\log(1-\hat{y})\right]

Here is a minimal NumPy implementation.

import numpy as np

class LogisticRegressionNLL:
    def __init__(self, n_features: int):
        self.w = np.zeros((n_features, 1))
        self.b = 0.0

    @staticmethod
    def sigmoid(z):
        return 1.0 / (1.0 + np.exp(-z))

    @staticmethod
    def binary_nll(y_true, y_prob, eps=1e-12):
        # Clamp to avoid log(0)
        y_prob = np.clip(y_prob, eps, 1 - eps)
        return -np.mean(
            y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob)
        )

    def predict_proba(self, X):
        z = X @ self.w + self.b
        return self.sigmoid(z)

    def fit(self, X, y, lr=0.1, epochs=1000):
        y = y.reshape(-1, 1)
        n = X.shape[0]

        for _ in range(epochs):
            y_prob = self.predict_proba(X)

            # Gradients of negative log-likelihood
            dz = y_prob - y
            dw = (X.T @ dz) / n
            db = np.mean(dz)

            self.w -= lr * dw
            self.b -= lr * db

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

The important detail here is that the gradient simplifies nicely:

\frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y

That clean derivative is one reason the log-likelihood formulation is so practical. A lot of ugly algebra collapses.

Showing the instability with extreme values

Now the failure mode.

If the model predicts $\hat{y}=1$ when the true label is $0$ , the loss contains

-\log(1-\hat{y})

If $\hat{y}=1$ , then this becomes $-\log(0)$ , which is undefined and numerically becomes infinity.

Likewise, if $\hat{y}=0$ and the true label is $1$ , I get $-\log(0)$ again.

That is why practical code either:

clips probabilities away from 0 and 1, or
computes the loss from logits directly using stable formulas.

For binary classification, the stable form is usually derived from the logit $z$ , not the probability $\hat{y}$ . One common version is:

\ell(z, y) = \max(z, 0) - zy + \log(1 + e^{-|z|})

This avoids numerical overflow when $z$ is very positive or very negative.

For softmax, the stable trick is to subtract the maximum logit before exponentiating:

\text{softmax}(z_i) = \frac{e^{z_i - m}}{\sum_j e^{z_j - m}} \quad\text{where}\quad m = \max_j z_j

This works because adding or subtracting the same constant from all logits does not change softmax probabilities:

\frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{e^{z_i - m}}{\sum_j e^{z_j - m}}

But it prevents huge exponentials from blowing up.

This is a good example of something that looks like a mathematical detail in class but becomes a production concern immediately.

Why softmax and log-likelihood fit together so naturally

Softmax is not just a convenient normalization layer. It is the probability model that makes the multi-class log-likelihood work cleanly.

The workflow is:

model produces logits $z$
softmax turns logits into class probabilities
log-likelihood measures the log probability assigned to the correct class
optimization pushes up the correct class logit relative to the others

That means the objective is fundamentally relative. The model is not just asked to make the correct class big; it is asked to make it big compared to competitors.

This relative competition is why softmax classifiers can become so peaky. Once the correct class starts dominating, the gradient keeps rewarding sharper separation.

LLM training objective is just token-level log-likelihood

This is where the same idea scales up.

A language model defines a probability distribution over token sequences:

p(x_1, x_2, \dots, x_T)

Using the chain rule of probability:

p(x_1, x_2, \dots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_{<t})

where $x_{<t}$ means all previous tokens.

This is the exact same joint-to-product step as before, except now the independence structure is autoregressive rather than i.i.d.

Take the log:

\log p(x_1, \dots, x_T) = \sum_{t=1}^{T} \log p(x_t \mid x_{<t})

Training the model means maximizing this sum over all sequences in the dataset, or equivalently minimizing

\mathcal{L}_{\text{NLL}} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t})

So when people say an LLM is trained with “next-token prediction,” this is the precise objective underneath it. At every position, the model produces logits over the vocabulary, softmax turns them into probabilities, and the loss is the negative log probability of the actual next token.

This also explains perplexity. Perplexity is just exponentiated average negative log-likelihood:

\text{Perplexity} = \exp\left( -\frac{1}{T}\sum_{t=1}^{T}\log p(x_t \mid x_{<t}) \right)

Lower perplexity means the model is less surprised by real text.

Why this objective can produce overconfidence

Now the subtle part.

Log-likelihood rewards assigning high probability to observed outcomes. If the model class is flexible enough, and especially if the labels are treated as exact one-hot truth, the optimizer keeps improving the objective by pushing the correct class probability closer to 1.

In binary classification, if the true label is $1$ , the loss is

-\log(\hat{y})

and this keeps decreasing as $\hat{y}$ approaches 1.

In softmax classification, if class $t$ is correct, the loss is

-\log \frac{e^{z_t}}{\sum_j e^{z_j}}

and this keeps decreasing as $z_t$ grows relative to the other logits.

Nothing in the pure objective says “be uncertain when uncertainty is appropriate.” It only says “fit the observed labels well.”

That has a few consequences:

1. One-hot labels are harsh targets

A one-hot target says the correct class should have probability 1 and all others 0. In messy real data, that is often stronger than reality.

For example, in language, many next tokens could be plausible, but the dataset only records one continuation. Training still treats that one observed token as the full target.

2. Separable data can drive logits to large magnitudes

In classification problems that are nearly separable, maximum likelihood may keep increasing weight magnitudes because sharper probabilities improve the objective.

3. Cross-entropy cares about ranking and mass placement, not calibration

A model can have excellent accuracy and low cross-entropy while still outputting probabilities that are systematically too extreme.

This is why techniques like label smoothing, temperature scaling, regularization, and calibration methods matter. They counteract the tendency of likelihood-based training to produce sharp distributions.

A simple example is label smoothing. Instead of a one-hot target like

[0, 1, 0, 0]

I train against something like

[0.025, 0.925, 0.025, 0.025]

This changes the target distribution so the model is not rewarded as strongly for collapsing all mass onto one class.

A mental model that ties it all together

The simplest way I think about log-likelihood now is this:

probability asks: if the parameters are fixed, how likely is this data?
likelihood asks: given this data, which parameters make it most plausible?
log-likelihood asks the same question in a form that is additive, stable, and optimizable
cross-entropy is the same objective viewed as expected surprise
softmax makes that objective work cleanly for multiple classes
LLM training applies the same principle token by token across sequences

That is why these topics keep collapsing into each other.

A small diagram makes the flow clearer:

From data to training objectives

One chain from observed data to the concrete losses used in classifiers and language models.

Solution

The clean way to understand log-likelihood is to stop treating it as an isolated definition and instead treat it as the center of a family of equivalent training objectives.

The core workflow looks like this:

Start with the probability of the observed dataset under a parameterized model.
Under factorization assumptions, rewrite the joint probability as a product of simpler conditional probabilities.
Take the log so the objective becomes a sum.
Maximize log-likelihood, or equivalently minimize negative log-likelihood.
Recognize that in classification, this is cross-entropy.
Recognize that in sequence modeling, this is token-level next-step prediction.

That gives me one framework that covers:

Bernoulli models for binary classification
categorical models with softmax for multi-class classification
autoregressive factorization for language models

It also explains implementation choices that otherwise look arbitrary:

why losses are often written in log-space
why libraries expose binary_cross_entropy_with_logits instead of asking for probabilities
why softmax is paired with cross-entropy
why calibration fixes are often needed after training

The trade-off is that likelihood-based training is excellent at rewarding correct predictions, but not automatically good at representing uncertainty in a calibrated way. That is not a flaw in the math; it is a consequence of what the objective is asking the model to do.

Implementation

I want to keep implementation focused on the ideas, not bury it in framework code.

1. Binary negative log-likelihood from probabilities

import numpy as np

def binary_nll_from_probs(y_true, y_prob, eps=1e-12):
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob)

    y_prob = np.clip(y_prob, eps, 1 - eps)
    loss = -(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))
    return loss.mean()

This matches the Bernoulli negative log-likelihood exactly. The clipping is there only to avoid log(0).

2. Binary negative log-likelihood from logits

In practice, I would rather compute from logits directly.

def binary_nll_from_logits(y_true, logits):
    y_true = np.asarray(y_true)
    logits = np.asarray(logits)

    # Stable binary cross-entropy with logits
    loss = np.maximum(logits, 0) - logits * y_true + np.log1p(np.exp(-np.abs(logits)))
    return loss.mean()

This version is numerically safer for extreme values.

3. Softmax cross-entropy from logits

def softmax(logits):
    shifted = logits - np.max(logits, axis=1, keepdims=True)
    exp_vals = np.exp(shifted)
    return exp_vals / np.sum(exp_vals, axis=1, keepdims=True)

def multiclass_nll(y_true, logits):
    probs = softmax(logits)
    n = logits.shape[0]
    return -np.mean(np.log(probs[np.arange(n), y_true]))

The important line is the shift by np.max(logits, axis=1, keepdims=True). That is the log-sum-exp stabilization in action.

4. Token-level language modeling objective

Even without implementing a full model, the objective is simple to write.

def sequence_nll(token_log_probs):
    """
    token_log_probs: list or array of log p(x_t | x_<t) for the observed tokens
    """
    token_log_probs = np.asarray(token_log_probs)
    return -token_log_probs.sum()

def average_token_nll(token_log_probs):
    token_log_probs = np.asarray(token_log_probs)
    return -token_log_probs.mean()

def perplexity(token_log_probs):
    return np.exp(average_token_nll(token_log_probs))

This is exactly what scales up in LLM training: gather the log probability assigned to each observed next token, sum them, negate, average.

5. A tiny numerical illustration of overconfidence

Suppose the true class is 1 in binary classification.

probs = np.array([0.60, 0.80, 0.95, 0.999])
losses = -np.log(probs)
print(losses)

This yields approximately:

[0.511,\ 0.223,\ 0.051,\ 0.001]

The loss keeps rewarding sharper confidence as long as the model is correct.

Now look at a confidently wrong prediction:

wrong_probs = np.array([0.40, 0.20, 0.05, 0.001])
wrong_losses = -np.log(wrong_probs)
print(wrong_losses)

That yields roughly:

[0.916,\ 1.609,\ 2.996,\ 6.908]

So the objective strongly discourages confident mistakes and strongly rewards confident correct predictions. That is exactly why it works well, and also part of why it can overshoot into overconfidence.

What I'd Do Differently

If I were teaching this earlier in my own learning process, I would spend less time on the slogan “maximize likelihood” and more time on the derivation from the joint probability of the dataset. That is the step that makes everything downstream feel inevitable instead of arbitrary.

I would also separate three questions more explicitly:

how do I derive the objective?
how do I compute it stably?
how well do its probabilities reflect real uncertainty?

Those are related but not identical.

In production systems, I would be careful not to stop at low loss or high accuracy. If the model’s probabilities drive decisions, ranking thresholds, risk policies, or user-facing confidence, calibration matters. A model trained with pure negative log-likelihood can still be miscalibrated even when it predicts well.

I would also emphasize logits more aggressively. A lot of instability disappears once I keep computations in logit space until the last possible moment. This matters in classifiers, and it matters even more in large-vocabulary softmax models.

Finally, for language modeling and large classifiers, I would treat one-hot labels as a modeling choice rather than a law of nature. Label smoothing, distillation targets, and calibration layers are not hacks around likelihood; they are ways of expressing a more realistic target distribution when the world is noisier than the labels suggest.

Key Takeaways

Log-likelihood comes directly from the probability of the observed dataset; under factorization assumptions, the joint probability becomes a product, and the log turns that product into a sum.
Maximizing likelihood and maximizing log-likelihood are equivalent because the log function is monotonic, but log-likelihood is much easier to optimize and compute stably.
Binary cross-entropy and categorical cross-entropy are just negative log-likelihoods for Bernoulli and categorical models.
Softmax pairs naturally with log-likelihood because it turns logits into a normalized probability distribution, and the loss becomes the negative log probability of the true class.
LLM training is the same idea applied autoregressively: maximize the log probability of each observed next token given previous tokens.
The same objective that makes models effective also pushes them toward sharp probabilities, which is one reason overconfidence and calibration problems show up in practice.