Why Is the Loss Function of Large Models Cross-Entropy

Posted on 2026-03-29 Edited on 2026-05-08 In large models

Prologue

When first entering the world of large models, due to gaps in foundational math knowledge such as linear algebra, probability theory, and information theory, it’s easy to get lost among numerous terms: logprob (log probability), likelihood, NLL (Negative Log Likelihood), cross entropy, perplexity. They frequently appear in various corners of papers and documentation, yet they all feel like acquaintances you only know by name — you see them often, but don’t truly understand them.

Then one day, after slowly catching up on some basic math knowledge and immersing myself in the company context for long enough, I finally realized during a chat with ChatGPT: the above set of concepts are essentially different perspectives of the same thing. Enter through the door of probability theory and it’s called NLL; step through the door of information theory and it’s called cross entropy; look through the door of PyTorch and it’s F.cross_entropy — different paths leading to the same destination, all essentially trying to characterize “how far the model’s current output is from the expected result.”

“Viewed from the side, a mountain looks like a ridge; viewed from the end, a single peak” — in a high-dimensional field like large models, this feeling of blind men touching an elephant is everywhere. But we three-dimensional creatures can only rely on long-term immersion, cross-verifying knowledge from different domains, until one day we suddenly have an epiphany — ah, so this is the same mountain.

What this article aims to do is talk about the most fundamental concept in the large model domain — the “many faces” of cross entropy as a loss function.

20260120 Bilibili Live — Key Takeaways on Switching to LLMs

Posted on 2026-01-25 Edited on 2026-05-08 In Large Models

I joined an LLM company in early 2024, having previously worked in the infra industry (databases, storage, etc.), so I have some very basic insights on switching careers. I haven’t shared on Bilibili for a long time; this live stream forced me to get back in gear. I answered some of your questions and bridged a bit of the information gap. This post is a slightly more organized summary of some points mentioned during the stream, with some materials I find valuable attached at the end.

Bilibili live stream: https://www.bilibili.com/video/BV1uckJBkEto

2025 Year-End Summary — Inward Growth

Posted on 2025-12-28 Edited on 2026-05-08 In life

Since becoming distinctly self-aware, never have I clashed so intensely with the world and with myself as I have this year—yet the result is strangely magical: I have become even more peaceful. Many subconscious reactions, many habitual practices, when excavated inward, can be traced back to such ancient reinforcement chains. Just as Shi Tiesheng said—the bullet fired in youth strikes squarely between the brows at this age.

Thus, whether forced or spontaneous, this year has become an inevitable journey of inward growth—observing and tracing the subtle origins of my emotional shifts, as in the investigation of things to extend knowledge. Seeing heaven and earth, seeing all beings, ultimately serves to see oneself. Although old inertia will persist for some time, the beginning of awareness is the seed that shapes a different trajectory.

Deep Dive Into Large Models 1: Transformer, the Foundation of Large Models

Posted on 2025-09-10 Edited on 2026-05-08 In Large Models

Princeton COS 597R “Deep Dive into Large Language Models” is a graduate course at Princeton University that systematically explores the principles of large language models, their preparation and training, architectural evolution, and applications in cutting-edge directions such as multimodality, alignment, tool use, and related issues. Note that this course focuses on conceptual understanding rather than engineering implementation.
I previously worked in distributed systems and database kernels, but in the past two years I moved to a large model company to work on data. These notes mainly consist of my organization and distillation of the course papers. What’s different is that I will combine some hands-on experience from solving practical problems at work, offering a bit of thinking from a career switcher’s perspective, hoping to help those who also want to enter algorithms from an engineering background.

This article comes from my paid column “System Thinking Daily”. Welcome to subscribe for more large model analysis articles; coupon information is at the end of the article.

This article mainly focuses on the foundational work of large models — Transformer.

First, we need to clarify the problem domain: what Transformer tries to solve is the sequence modeling problem, with the main representatives being language modeling and machine translation. Second, we need to know the problems existing in predecessor methods — RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) — in order to understand the innovation of Transformer. Finally, the key points of Transformer’s solution lie in the “multi-head attention mechanism” and “positional encoding”.

Some Practices for Large-Scale Data Processing on the Cloud

Posted on 2025-06-04 Edited on 2026-05-08 In Technology , Cloud

As cloud infrastructure continues to mature, emerging companies typically move their infrastructure to the cloud in order to achieve business goals quickly. Developing on the cloud is actually quite different from traditional development using physical machines directly. The cloud emphasizes sharing and elasticity more, and as scale grows, isolation becomes important as well. These changes also force us to make some adjustments when developing. For large-scale data processing on the cloud, I mainly have experience with Spark and Ray, using Python as the primary language. Starting from these technology stacks, I’d like to share some development practices that have proven to be fairly effective.

When using Ray for large-scale data processing on the cloud, the basic idea is: build the minimum parallelizable unit, perform functional and performance testing, and then scale using ray.data (e.g., map, map_batches). When using Spark, it’s slightly different; compared to Ray, Spark is somewhat less flexible but has better abstraction and encapsulation. You can think about data processing from the perspective of the dataset as a whole, and Spark will automatically scale and handle fault tolerance based on the number of partitions and parallelism you set.

A Data Visualization Powerhouse — the Interesting Philosophy of Streamlit

Posted on 2025-03-18 Edited on 2026-05-08 In Technology , Tools

streamlit is a Python library for quickly developing simple web apps. Its slogan is:

A faster way to build and share data apps

In other words, “a faster way to build and share data applications.” It is very popular in machine learning, data science, and even today’s large language model space. Its advantages are quite prominent:

Uses the favorite language of developers in the above fields: Python. No need to write frontend code; just pip install and you’re ready to go.
With just a few lines of code, you can quickly whip up a web page for data visualization, labeling, and other small tools.
It also supports rich third-party component extensions, such as the community-developed code_editor.

Of course, if you also need low latency, high concurrency, or deep customization, then sorry — that’s the part streamlit has traded off. But for small tools intended for internal use by a handful of people, streamlit is simply a godsend. You could say it occupies this small ecological niche so perfectly that it was acquired by Snowflake for $800 million in 2022.

In this article, let’s take a look at its basic design philosophy and some simple practices.

Design Philosophy

Its basic design philosophy can be summarized as:

Write frontend in a backend language
Rebuild upon receiving new events
Support session-level caching

Understanding Jemalloc's Memory Management Through "Hive Box" Parcel Lockers

Posted on 2024-10-27 Edited on 2026-05-08 In Computer Science

Introduction

In certain workloads, memory usage gradually grows over time until an OOM occurs. Later, it was found to be a memory fragmentation issue. Replacing the system’s default memory allocator (glibc malloc) with jemalloc can effectively control the upper bound of memory growth.

To understand the principles behind it, I sought out jemalloc’s original paper: A Scalable Concurrent malloc(3) Implementation for FreeBSD. Of course, compared to when the paper was published in 2006, the current jemalloc may have changed significantly. Therefore, this article is only responsible for the content of the paper at that time. For more jemalloc mechanisms, you can check the documentation and source code in its GitHub repository.

Background

Before discussing the main ideas of the paper, let’s briefly review the role and boundaries of a memory allocator. In short:

Downward, it requests large chunks of memory from the operating system (using system calls like sbrk, mmap)
Upward, it handles memory allocation requests of various sizes from the application layer (malloc(size)), and releases them after the application layer indicates it is no longer needed (free)

In the simplest terms, the allocator’s functions are very simple: allocation and deallocation (malloc and free). One might imagine the implementation is also very straightforward—just use a table to keep track of all used and unallocated memory (a bit of bookkeeping), and then:

When a malloc request comes in, first look in the free list; if there’s not enough, ask the OS for more
When a free request comes in, return it to the free list; if there’s too much free memory, return it to the OS

'Snowflake: The Pioneer of Cloud-Native Data Warehouses'

Posted on 2024-08-25 Edited on 2026-05-08 In Database , Paper Reading

Snowflake was founded in 2012 by two former Oracle employees, targeting cloud-native data warehouses from the very beginning. Therefore, its architectural design was (at the time) considered very “radical.” This forward-looking vision brought extraordinary returns — Snowflake went public in 2020 with a market capitalization reaching as high as $70 billion, setting the record for the largest software IPO in history.

In this article, we combine two papers — The Snowflake Elastic Data Warehouse and Building An Elastic Query Engine on Disaggregated Storage — to roughly discuss its architectural design.

This article comes from my column “System Thinking Daily.” If you find it helpful, feel free to subscribe to support me.

I have wanted to write this article for a long time, but I got stuck while reading the papers last time — there was too much information, and carpet-bombing reading soon drowned me in details. At that time, I only read two-thirds of it and then put it aside. Last week (2024-07-07), when mentioning the disaggregated-storage Snowflake in the article Spark: How to Scale Down in the Cloud, a reader asked me to write about it, so I picked it up again.

Compared with the push approach last time, this time I adopted a pull approach: that is, instead of passively reading papers, I first thought about how I would design such a cloud-native data warehouse and what problems I might encounter. With these questions in mind, I went back to the papers for answers, and found that the efficiency improved dramatically, which also prevented this article from being abandoned again.

Life Is a Wilderness —— Bertrand Russell's "The Conquest of Happiness"

Posted on 2024-07-28 Edited on 2026-05-08 In Life , Reading

I first heard about it from a passing mention on a podcast, so I sought out the audiobook for my commute. This is the 1939 translation by Fu Lei, with a faint, old-fashioned vernacular style. It’s a short book; I finished it in a few days. I like listening to things while walking—what enters my ears and what meets my eyes, the philosopher’s concise aphorisms and the myriad scenes of the street, always produce a curious chemical reaction in my mind, occasionally sending a jolt through me even in the height of summer.

Lately my emotions have been rather turbulent, and listening to this book during my daily commute has brought me comfort and calm on several occasions. The causes of happiness and unhappiness the book points out all hit upon certain flaws and traits of mine, so after finishing it I felt I should write something down.

Bertrand Russell’s “The Conquest of Happiness”

After humanity transitioned from the hunting era to the agricultural era, although we gained relative stability in life, we lost the outward exploration and adventure. In the industrial era, with accelerated urbanization, the “blue-collar and white-collar” workers further detached from nature are no different. Only a small number of entrepreneurs still maintain a jungle-like way of life.

Choosing stability means having a great deal of “boredom” to dispel. But most people excessively concentrate their attention on themselves—for example, the persecution maniac (obsessing over behavior that doesn’t conform to childhood prejudices or social conditioning), the narcissist (excessive vanity seeking external praise), and the megalomaniac (excessive desire for power)—which causes this boredom to grow wildly in fantasy until it fills people’s hearts.

'Large-Scale Data Processing With ray.data (Part 2): A Global Perspective'

Posted on 2024-07-07 Edited on 2026-05-08 In Distributed Systems , Architecture

ray.data is a wrapper layer built on top of ray core. With ray.data, users can implement large-scale heterogeneous data processing (mainly using both CPU and GPU) with simple code. In one sentence: it’s simple and easy to use, but also has many pitfalls.
In the previous post, we started from the user interface and briefly outlined the main APIs of ray.data. In this post, we will take a macroscopic view and roughly go through the basic principles of ray.data. After that, we will use a few more posts, combined with code details and practical experience, to discuss several important topics: execution scheduling, data formats, and a pitfall avoidance guide.
This article comes from my column “System Thinking Daily”. If you find the article helpful, welcome to subscribe to support me.

Overview

From a high-level understanding, a ray.data processing task can be roughly divided into three sequential stages:

Data Loading: Reading data from external systems into Ray’s Object Store (e.g., read_parquet)
Data Transformation: Using various operators to transform data in the Object Store (e.g., map/filter/repartition)
Data Write-back: Writing data from the Object Store back to external storage (e.g., write_parquet)

Interesting Linear Algebra (Part 1): Matrix Multiplication

Posted on 2024-06-29 Edited on 2026-05-08 In math

Since I always struggled to keep up with understanding the physical meaning of various matrix operations, despite multiple attempts to get into machine learning over the years, I was always turned away at the door. By chance, a colleague recommended MIT’s classic linear algebra open course. After listening to a few lectures, it was quite exhilarating; the previously tightly shut door seemed to open a crack.

Therefore, this series will share some interesting points from the course in each article. To avoid being obscure, each chapter will be as context-independent and concise as possible, so please enjoy with confidence. Consequently, this series will sacrifice some precision and is not systematic; it merely aims to spark a little interest. Note: Examples are all generated by KimiChat.

Infra Interview: Data Structures v — In-Order Assembly

Posted on 2024-05-05 Edited on 2026-05-08 In career , interview

This is a problem I encountered a long time ago. It was quite interesting, so I still remember it to this day. The problem borrows the context of TCP and asks you to implement a key piece of TCP logic: “in-order assembly”:

From the TCP layer’s perspective, IP layer packets are received out of order.
From the application layer’s perspective, data delivered by the TCP layer is in order.

What’s interesting about this problem is that by borrowing the TCP context, you can first discuss some TCP fundamentals with the candidate, then pivot to introduce this problem. This way, you can test both foundational knowledge and engineering coding skills.

Problem

struct Packet {
    size_t offset;
    size_t length;
    uint8_t *data;
};

// 实现一个“顺序交付”语义
class TCP {
  // 应用层调用：按顺序读取不超过 count 的字节数到 buf 中，并返回实际读到的字节数
  size_t read(void *buf, size_t count);
  // TCP 层回调：得到一些随机顺序的 IP 层封包
  void receive(Packet* p);
  // TCP 层回调：数据发完，连接关闭
  void finish();
};