木鸟杂记

大规模数据系统

Why Is the Loss Function of Large Models Cross-Entropy

Prologue

When first entering the world of large models, due to gaps in foundational math knowledge such as linear algebra, probability theory, and information theory, it’s easy to get lost among numerous terms: logprob (log probability), likelihood, NLL (Negative Log Likelihood), cross entropy, perplexity. They frequently appear in various corners of papers and documentation, yet they all feel like acquaintances you only know by name — you see them often, but don’t truly understand them.

Then one day, after slowly catching up on some basic math knowledge and immersing myself in the company context for long enough, I finally realized during a chat with ChatGPT: the above set of concepts are essentially different perspectives of the same thing. Enter through the door of probability theory and it’s called NLL; step through the door of information theory and it’s called cross entropy; look through the door of PyTorch and it’s F.cross_entropy — different paths leading to the same destination, all essentially trying to characterize “how far the model’s current output is from the expected result.”

“Viewed from the side, a mountain looks like a ridge; viewed from the end, a single peak” — in a high-dimensional field like large models, this feeling of blind men touching an elephant is everywhere. But we three-dimensional creatures can only rely on long-term immersion, cross-verifying knowledge from different domains, until one day we suddenly have an epiphany — ah, so this is the same mountain.

What this article aims to do is talk about the most fundamental concept in the large model domain — the “many faces” of cross entropy as a loss function.

NLL-entropy.pngNLL-entropy.png

Author: 木鸟杂记 https://www.qtmuniao.com/2026/03/29/llm-loss/ Please cite the source when reposting

From “Continuation” to NLL — The Loss Function from a Probability Perspective

Before discussing the loss function, let’s first perceive the working principle of large models in the most direct way. Don’t be dazzled by the superstructures like Agents, reasoning, multimodality, and world models. At their foundation, they are only doing one thing: given context, predict the next token. For example, if you ask “The capital of France is,” the model will think “Paris” is very reasonable, “Tokyo” doesn’t seem right, and “database” is completely absurd. The so-called “language model” is essentially a machine that judges “how to continue writing.”

This angle is crucial because all the subsequent prob, logprob, and loss revolve around this action — whether the model continues writing correctly, and what ruler should I use to measure it.

With this starting point, let’s continue to break down the training objective of language models. Suppose we have a sample where the prefix is “How long does a refund take to arrive?” and the suffix is the standard answer. The metric we care about transforms into: given this prefix, what is the probability that the model will produce this suffix? Expressed in conditional probability and written in symbols, this is . The whole story begins here.

But when the model answers, it doesn’t “spit out half of the prosperous Tang Dynasty with a single utterance.” That is, the model doesn’t vomit out the entire answer in one breath; it continues writing one token at a time. In the first step, it sees the prefix and predicts token1; in the second step, it sees prefix+token1 and predicts token2; and so on.

For example, when a large model generates the sentence “The capital of France is Paris”:

First, it uses “The capital of France is” as context to generate “Paris.”

Then, it uses “The capital of France is Paris” as context to generate “.”

This is like an escape room — every door requires you to answer the code correctly to pass. Getting a whole sentence right can be broken down into getting every word in that sentence right. Therefore, the probability of continuing to write the entire sentence (prefix) is the product of the conditional probabilities of predicting the next token (token) correctly at each step:

But multiplication brings a fatal engineering problem — the final probability value will be absurdly small. A sentence of 100 tokens, with an average prediction accuracy of 0.1 at each step, results in a total probability of . In the end, the computer might not even be able to represent such a small number (if you’re interested, this is called floating-point underflow). The solution in mathematics is very simple: take the logarithm of both sides. Moreover, logarithms have an excellent property — they turn multiplication into addition.

What was originally a long string of decimal multiplication now becomes numerical accumulation, which is more stable, easier to analyze, and convenient to break down by token. This is the same principle as keeping accounts when investing — you usually wouldn’t say “my current assets are last year’s principal multiplied by 1.05 then by 0.97 then by 1.12,” you might say “for my investments, I gained 5 points the year before last, lost 3 points last year, and so far this year, I’m up 12 points.” Using cumulative addition to record changes is far cleaner than using multiplication to record totals. That is, we don’t need to precisely focus on the total; the cumulative change already gives us a rough grasp of your investment ability. So in engineering, log probability (logprob) is more commonly used than direct probability multiplication (prob). You can roughly understand logprob as a “total score” of the model’s tendency to output a certain answer.

At this point, things become elegant. If we were to design a loss function for large model training (loss function, which can be crudely understood as the model’s training objective being to minimize the loss value), it needs to satisfy two intuitions: if the probability of the model continuing to write the correct answer is high, the loss is small; if the probability is low, the loss is large. Since is monotonically increasing, and we want a monotonically decreasing function, the simplest way is to take the negative: .

Why is this function we constructed through twists and turns usable? This is the most basic method in mathematical modeling — we tend to use the simplest form that satisfies various properties (such as monotonicity, i.e., trends, and rate of change, i.e., derivatives, etc.) to construct formulas.

And as a loss function happens to satisfy three properties:

  • First, no penalty for being right — when , , which is the right trend.
  • Second, heavy penalty for being wrong, and the more wrong, the heavier the penalty — this is the most interesting part of the curve. At , the loss is only 0.105; at , it rises to 2.303; at , it soars to 4.605. The lower the probability, the heavier the penalty. This is very much like a spring — a light pull has no feeling, but when pulled to the limit, the rebound force increases dramatically. This is the right rate of change.
  • Third, naturally consistent with probability modeling — we originally wanted to maximize the probability of the model continuing to write the correct answer, i.e., , which is equivalent to maximizing its logarithm , and then transforming it into the more common minimization form in optimization problems, which is .

So this loss function wasn’t pulled out of thin air; it grew step by step from the simple idea of how to measure “whether the model can continue to write the correct answer.”

At this point, we arrive at a name that looks quite strange — NLL, Negative Log Likelihood. As just mentioned: the probability that the model makes mistakes when continuing to write the answer, and what we call large model “training” is the process of trying to minimize it through massive amounts of data.

For the entire paragraph written by the model, its

That is, the accumulation of the probability of making mistakes when continuing to write each token, divided by the number of tokens in the paragraph, is the average loss per token. The concept itself isn’t too complicated; the reason it looks scary is simply because the five characters “negative log likelihood” wrapped a simple thing in three layers. Breaking it down — negative sign, logarithm, likelihood — each layer we just discussed (likelihood is the probability of continuing to write something that looks right).

Cross Entropy — Re-examining from an Information Theory Perspective

If negative log likelihood is climbing Mount Lu via the path of probability theory, then cross entropy is ascending via the path of information theory.

Information theory was born in the field of communications — in Shannon’s “A Mathematical Theory of Communication.” It doesn’t care about what a message semantically says, but tries to quantify how much “information” a message contains in a statistical sense. In computer terms, it’s the minimum number of bits required to transmit that message.

So how do we determine the amount of information in a message?

  • Discrete method. A common method in computing is to continuously eliminate the uncertainty of the message through binary division. Each binary division introduces one bit for encoding.

  • Continuous method. Use a function to fit. Use probability to represent the uncertainty of a message , and use negative logarithm to represent the trend — the more certain (higher probability), the less information, and it decreases logarithmically. That is, . In information theory, this value is also called “self-information.”

Furthermore, we can introduce the most fundamental concept in information theory — entropy (entropy, or information entropy; there is a deep connection with thermodynamic entropy, and they can both be linked to entropy increase). It is defined as “the expected (weighted average) amount of information in a system.” That is, the amount of information refers to a single event

For example, suppose a system contains three events a, b, and c, with a occurring with probability 0.1, b with probability 0.4, and c with probability 0.5. Then the entropy of this system is

That is, the weighted average of the amount of information for all events, which is the expected value (E) of the amount of information.

Now let’s talk about cross entropy, which describes the consistency between two distributions: If the real world (system 1) poses questions according to some distribution (law) , and the large model (system 2) answers with another distribution (law) , what is the probability that the large model will get it wrong? Written as a formula, this is

That is:

Another example — you go to an unfamiliar city with an old map in your hand (model distribution ), while the city has already undergone massive demolition and reconstruction (true distribution ). The cross entropy formula aims to characterize how many extra detours you will take on average while walking with this old map. The more accurate the map, the fewer detours; the more outdated the map, the harder it is for you to move. In short: cross entropy measures the degree of alignment between the model distribution and the true distribution.

So what is its relationship with NLL? We construct a language system in which there is a vocabulary, and each event is a token. In the data world, all our text can be represented as some sequence of tokens in the vocabulary. When the large model performs inference (i.e., continues to write the next token), it is essentially selecting the token with the highest probability from the vocabulary based on the current state, within the distribution it has learned. And what can ultimately characterize this distribution? That’s right, “entropy.” So how do we measure whether the large model’s training conforms to the true distribution? That’s right, by minimizing “cross entropy.” That is, the “world map” we have trained can be used to predict paths in the physical world.

Returning to the formulas, let’s build a bridge. Every time we continue to write the next word, the existing words form a “current cross-section” of the system. Standing on the state of our current system, when we need to choose the next word, we have a selection probability for each word in the vocabulary. In this transient system, each word is information, and the probability distribution when selecting words corresponds to , which is its amount of information.

In language model training, this can actually be categorized as a classification problem. That is, for each training sample specifically, with a fixed prefix and a determined next token, its probability is 1, and the probabilities of all other tokens are 0. We call this one-hot. Substituting into the cross entropy formula , since only the correct class has , the entire summation collapses into a single term: . Isn’t this NLL?

There is an easy pitfall here worth mentioning separately: what cross entropy measures is not how similar two pieces of text are semantically — that is something that requires comparing semantic vectors. What it looks at is something more fundamental: for every token in the true sequence, whether the model has assigned it a sufficiently high probability. That is, statistical alignment, not similarity on individual sentences.

Then why is the loss function of large models specifically , and not MSE, Hinge, or some other penalty? Because it’s not just “good enough,” but multiple paths simultaneously point to it — probability theory says it’s the natural form of maximum likelihood, information theory says it’s the optimal measure of encoding cost, numerical computation says it turns multiplication into summation to stabilize gradients, and in terms of optimization behavior, because it penalizes low-probability events more heavily, the training signal is strong. Four directions can all point to the same formula; this is not an industry convention or historical baggage. Probability theory and information theory, two paths that converge to the same expression in the end — this convergence of different paths is one of the most fascinating aspects of the world.

Summary

Compressing this article into one sentence: We want to know how far the model is from the correct answer; the most natural way is to look at how much probability it assigned to all correct answers; taking the negative logarithm of this probability yields a loss function that is consistent with probability theory, consistent with information theory, and easy to optimize — probability theory calls it NLL, information theory calls it cross entropy, they are the same thing.

If you are transitioning from systems, data, or Infra to learn about large models, don’t rush to chew through an information theory textbook from cover to cover. First, anchor a few basic intuitions: the most direct way a model works is “predicting the next token,” the loss function attempts to quantify how far the model is from the correct answer, cross entropy looks at probability alignment rather than semantic similarity, and logprob is the key tool that makes probability calculations stable.

Many formula derivations have been simplified for ease of understanding without rigorous proofs. Please feel free to point out any inappropriate parts.


我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg