Prologue
When first entering the world of large models, due to gaps in foundational math knowledge such as linear algebra, probability theory, and information theory, it’s easy to get lost among numerous terms: logprob (log probability), likelihood, NLL (Negative Log Likelihood), cross entropy, perplexity. They frequently appear in various corners of papers and documentation, yet they all feel like acquaintances you only know by name — you see them often, but don’t truly understand them.
Then one day, after slowly catching up on some basic math knowledge and immersing myself in the company context for long enough, I finally realized during a chat with ChatGPT: the above set of concepts are essentially different perspectives of the same thing. Enter through the door of probability theory and it’s called NLL; step through the door of information theory and it’s called cross entropy; look through the door of PyTorch and it’s F.cross_entropy — different paths leading to the same destination, all essentially trying to characterize “how far the model’s current output is from the expected result.”
“Viewed from the side, a mountain looks like a ridge; viewed from the end, a single peak” — in a high-dimensional field like large models, this feeling of blind men touching an elephant is everywhere. But we three-dimensional creatures can only rely on long-term immersion, cross-verifying knowledge from different domains, until one day we suddenly have an epiphany — ah, so this is the same mountain.
What this article aims to do is talk about the most fundamental concept in the large model domain — the “many faces” of cross entropy as a loss function.


