木鸟杂记

大规模数据系统

A Data Visualization Powerhouse — the Interesting Philosophy of Streamlit

streamlit is a Python library for quickly developing simple web apps. Its slogan is:

A faster way to build and share data apps

In other words, “a faster way to build and share data applications.” It is very popular in machine learning, data science, and even today’s large language model space. Its advantages are quite prominent:

  1. Uses the favorite language of developers in the above fields: Python. No need to write frontend code; just pip install and you’re ready to go.
  2. With just a few lines of code, you can quickly whip up a web page for data visualization, labeling, and other small tools.
  3. It also supports rich third-party component extensions, such as the community-developed code_editor.

Of course, if you also need low latency, high concurrency, or deep customization, then sorry — that’s the part streamlit has traded off. But for small tools intended for internal use by a handful of people, streamlit is simply a godsend. You could say it occupies this small ecological niche so perfectly that it was acquired by Snowflake for $800 million in 2022.

In this article, let’s take a look at its basic design philosophy and some simple practices.

Design Philosophy

Its basic design philosophy can be summarized as:

  1. Write frontend in a backend language
  2. Rebuild upon receiving new events
  3. Support session-level caching

Author: 木鸟杂记 https://www.qtmuniao.com/2025/03/18/streamlit/ Please indicate the source when reposting

The above three points are sequentially progressive designs: because the frontend is written in a backend language, every user request triggers a full re-execution of the code, so of course partial refreshing is not supported; to avoid unnecessary data loading caused by full re-execution, fine-grained caching (which is lost each time the user actively refreshes the page) is introduced, thereby achieving a goal similar to partial refresh — just cache the parts that don’t need refreshing and reuse them.

Summed up in one sentence: Sequential execution for simplicity, on-demand caching for efficiency.

An Example

Let’s keep the functionality simple: the user inputs a local path to a Parquet file, and we read and display it visually.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# app.py

import streamlit as st
import pandas as pd
import pyarrow.parquet as pq
import os

# 加载 Parquet 文件
def load_parquet(file_path):
return pq.read_table(file_path).to_pandas()

st.title("学生绩点可视化")

file_path = st.text_input("请输入 Parquet 文件的地址:", value="students.parquet")

# 检查文件是否存在
if os.path.exists(file_path):
# 加载数据
data = load_parquet(file_path)
st.write("数据预览:")
st.dataframe(data)

# 可选项1:显示统计信息
if st.checkbox("统计信息"):
st.write(data.describe())

# 可选项2:绘制直方图
if st.checkbox("直方图"):
st.bar_chart(data.set_index('姓名')['绩点'])
else:
st.error("输入的文件路径不存在,请检查后再试。")

We simply construct data with two columns — student name and GPA — and store it in a Parquet file (the construction code is placed in the appendix). Run the following command:

1
streamlit run app.py

Then visit http://localhost:8501 to see the page:

streamlit-example.pngstreamlit-example.png

From this example, we can roughly see that streamlit’s syntax is quite concise:

  1. Component construction: Through interfaces like st.title, st.dataframe, st.checkbox, you can quickly construct many standard components without worrying about their styles.
  2. Sequential execution: Unlike JavaScript’s event-driven model, streamlit code is executed sequentially from top to bottom, making it very easy to understand and debug. Every time you re-enter a path or re-click a checkbox, the entire page will re-render.

Caching

So the question arises: if the student table contains a large amount of data, wouldn’t it be redundant and slow to fully re-execute and reload every time a path is re-entered? To address this, we can use streamlit’s caching mechanism to cache the data.

You can explicitly cache using st.session_state:

1
st.session_state[file_path] = data

You can also cache by adding the @st.cache_data decorator to the data-loading function. In this case, the cache key is the function’s input parameter, which in this example is also file_path.

1
2
3
@st.cache_data
def load_parquet(file_path):
return pq.read_table(file_path).to_pandas()

But in streamlit, we can cache not only data but also components (widgets). For example, for st.dataframe, if we don’t want it to re-render every time, just give it a key!

1
st.dataframe(data, key=f"df-{file_path}")

This way, as long as the key doesn’t change, the dataframe will not re-render.

At this point, we roughly understand streamlit’s philosophy from an intuitive perspective: keeping simplicity through sequential execution, maintaining efficiency through on-demand caching. Let’s summarize with a diagram:

streamlit-architecture.pngstreamlit-architecture.png

Summary

This article very briefly analyzed streamlit’s design philosophy through a small example to help everyone build an intuitive understanding. If you have similar GUI needs for internal team use, it’s worth a try.

But due to space constraints, we didn’t delve into how it’s implemented behind the scenes, nor did we cover more advanced usage. If you’re interested in these topics, please leave a comment and let me know.

References

Official documentation: https://docs.streamlit.io/get-started/fundamentals/advanced-concepts

Appendix

Libraries to install:

1
pip install streamlit pyarrow pandas

Code to construct data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd

# 创建学生数据
data = {
'姓名': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'绩点': [3.5, 3.8, 2.9, 3.2, 3.9]
}

# 创建 DataFrame
df = pd.DataFrame(data)

# 写入 Parquet 文件
df.to_parquet('students.parquet', index=False)

print("学生数据已写入 students.parquet 文件。")

我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg