As cloud infrastructure continues to mature, emerging companies typically move their infrastructure to the cloud in order to achieve business goals quickly. Developing on the cloud is actually quite different from traditional development using physical machines directly. The cloud emphasizes sharing and elasticity more, and as scale grows, isolation becomes important as well. These changes also force us to make some adjustments when developing. For large-scale data processing on the cloud, I mainly have experience with Spark and Ray, using Python as the primary language. Starting from these technology stacks, I’d like to share some development practices that have proven to be fairly effective.

When using Ray for large-scale data processing on the cloud, the basic idea is: build the minimum parallelizable unit, perform functional and performance testing, and then scale using ray.data (e.g., map, map_batches). When using Spark, it’s slightly different; compared to Ray, Spark is somewhat less flexible but has better abstraction and encapsulation. You can think about data processing from the perspective of the dataset as a whole, and Spark will automatically scale and handle fault tolerance based on the number of partitions and parallelism you set.

Author: 木鸟杂记 https://www.qtmuniao.com/2025/06/14/data-processing-on-cloud/ Please indicate the source when reposting

Sharing

Sharing can be roughly divided into two aspects: on one hand, the sharing of environments across multiple development machines; on the other hand, the sharing between development and production environments. Why is there a need for multiple development machines to share environments? First, we always have various reasons for needing to migrate development machines; second, there may be distinctions between CPU and GPU machines; third, colleagues, development machines, and production machines occasionally want to share things through the file system.

If you switch machines for development, you’ve likely encountered some pain points: the environment you painstakingly configured on one machine has to be redone on another. Of course, some might solidify this process into a script that runs on each startup. But even if this process can be standardized and automated, each installation still takes some time and can’t achieve an out-of-the-box experience.

Just like Java’s slogan “Write Once, Run Anywhere,” when configuring development machine environments, can we achieve an “configure once, switch machines freely” effect? Below, I share a rather crude practice.

To achieve multi-machine sharing, you can apply to the cloud provider for a large shared cloud disk that supports POSIX semantics and multi-point read/write. Then place all user directories on this cloud disk. When a new development machine is available, simply mount this disk and symlink your user directory to the local user root directory (/home). This solves the data sharing problem. Going further, can the account system also be shared? If accounts across multiple machines are not unified, it means the same data directories need to be authorized differently for multiple accounts on multiple machines. Authorization is manageable, but ownership can only belong to one entity—granting it to one machine’s account means it can’t be given to another machine’s account. The root of this problem lies in the fact that account IDs (uid and gid) on different machines may conflict. Therefore, to truly achieve seamless multi-machine user directory sharing, we also need to implement a shared account system.

Studying the Linux user system reveals that the minimal set of all user information resides in two files:

Basic user information: /etc/passwd
User group information: /etc/group

As long as all machines share these two files, they can share the account system. But new problems arise: where is this information generated? What if a user maliciously modifies its contents?

For the first problem, referencing common operations in distributed systems, you can have one development machine act as the “master”—all users are created by this machine. Other machines are called “followers,” which mount these two files to replace the original system user information files. UIDs/GIDs generated by the unique master thus avoid conflicts in the account system.

For the second problem, group administrators control the unified creation of accounts. But we can’t deny other users sudo privileges (otherwise installing things locally would be troublesome), so casually modifying the permissions of these two files still can’t be fully controlled—this can only be handled by verbal agreement among team members.

In this way, we initially achieve our goal of “configure once, switch machines freely.”

But under this system, there are still some practical issues to solve, such as using tools like conda as much as possible to install all environments and dependencies into your own user directory rather than the system disk of each machine. We’ll elaborate on this in the Isolation section.

Elasticity

This is also one of the cloud’s biggest selling points: on-demand elasticity, pay-as-you-go. But reality is never that rosy.

Cost

From a cost perspective, elastic resources are always much more expensive than reserved instances (annual/monthly subscriptions). Therefore, as cloud provider users, we typically purchase some reserved instances to meet most scenario requirements, and then temporarily use elastic resources when occasionally needed. But if your load doesn’t converge, you face a dilemma: sizing the reserved pool close to the upper demand limit results in a lot of idle waste; sizing it close to the lower demand limit means frequently using elastic resources, which is also expensive. If your resource usage is relatively stable, you can save money consistently; but if it’s quite variable, sorry—the cloud’s elasticity only solves the problem of availability, not cost. For regular usage, in order to reduce costs and increase efficiency, you still need to precisely calculate and manage your resource consumption.

If the number of machines is relatively large, we typically use container technology to pool all machine resources and use Kubernetes to orchestrate and schedule tasks. Thanks to Kubernetes’ openness, most computing frameworks can now be deployed on k8s with customized Operators at the push of a button, such as kuberay and spark on k8s. This approach allows for better pooling and allocation of resources. This leads to another classic problem—scheduling. In the scheduling domain, we can consider the attributes of each pending task in terms of urgency and resource amount. Here are a few usage scenarios:

Preemption: Suppose resources are now exhausted and many tasks are queued, but a new urgent task arrives. To quickly obtain resources, it needs to use a higher priority and jump ahead of others, grabbing resources as soon as they become available. Going further, if the need is even more urgent, you can use high priority + preemptive scheduling to directly snatch resources from lower-priority tasks. Of course, a better approach is manual intervention—actively killing some (not all) non-urgent task pods to temporarily yield resources. Therefore, in distributed systems, retrying subtasks on different machines is a hard requirement.
Deadlock: Scheduling can have deadlocks? Yes. Consider this scenario: there are 100 CPUs remaining in the cluster, and there are two Ray tasks—task A needs 80 cores, task B needs 60 cores. Either single task can be scheduled into the cluster and start, but if task A grabs 70 cores and task B grabs 30 cores, neither can start. At this point, you need to introduce gang scheduling (e.g., volcano), where resources are allocated to a task all at once only after all required resources are satisfied—just like a lock. With this guarantee, either A gets scheduled or B gets scheduled, but the deadlock-like situation above won’t occur.
Elasticity: When using k8s to run computing frameworks, elasticity is typically at the task level—if there are more resources in the pool, use more; if fewer, run with lower concurrency and slower. Spark does this very well. As for KubeRay’s elastic scheduling, it’s currently a joke—we’ll see about the future. But Spark’s good elasticity brings a problem: if a user allocates a lot of resources for a large task without setting a priority, it can easily consume the entire pool.

Therefore, for large task scheduling on the cloud, a task scheduler that supports gang scheduling, priority scheduling, and has relatively high scheduling efficiency (e.g., Spark spinning up tens of thousands of pods at once without choking) is essential.

Code

From a code perspective, scaling programs to large scale is not without cost. Of course, this cost is getting lower and lower. In ancient times, there was Hadoop MapReduce; later, for low latency and ease of use, there was Spark and Flink; in the machine learning era, Ray became prevalent. In concept, Ray is like a large distributed computer, adopting the classic master-worker architecture, collecting all memory to provide object-level KV abstraction called the object store; collecting all GPU/CPU to provide the foundation for concurrent execution, supporting fractional logical allocation.

Ray supports very fine-grained parallelism with maximum flexibility, thus meeting the complex and ever-changing data processing needs of the large model era. But the cost is that some common high-level abstractions in other distributed systems have to be built by yourself in Ray:

Logical scheduling. Ray uses quantifiable labels for logical scheduling (interested readers can refer to this blog post). The gap between logical and physical resources needs to be filled by your own tuning. For example, if your actor/worker logically declares the need for 10G of memory, but the physical need far exceeds 10G, then multiple such workers scheduled onto one machine can easily cause that machine’s memory to explode. CPU/GPU overcommitment may not explode the machine, but the disunity between logical and physical resources either causes waste or severe contention.
Upstream-downstream coordination. To better support data processing, Ray wrapped the ray.data library on top of ray core. But if the production-consumption speed of upstream and downstream operators is not well controlled, it can easily result in either upstream accumulation gradually blowing up machine memory, or downstream idleness causing resource waste.
Non-standard data items. When doing large-scale data processing, you often find after processing for a long time that one or two anomalous data items kill the task. Without proper error handling or checkpoint/resume, you’ll pay a painful price. Of course, Ray provides silent error configuration, but the cost is conserved—if you don’t deal with this, downstream colleagues will have to.

Debugging

In large-scale parallel systems, errors are very common; but reproducing errors is usually difficult—too much data, too complex an environment, too time-consuming to reproduce, etc. To quickly locate problems when they occur, you need to build an observable system:

Log collection. Output logs for all critical path steps, dividing the entire processing code into “cells” through logs, so that when an error occurs, the relevant code segment can be quickly located.
Metrics statistics. Data processing is typically divided into multiple stages. When OOM occurs, it’s likely that some stage has data accumulation that blows up the machine. If you can collect statistics on some key processing metrics, you can quickly find the cause when performance issues arise. Of course, metrics statistics can be divided into application-level and system-level. The above refers to the application level at the user code layer. For the system level, going one layer deeper, such as Ray actor’s resource usage over time; going another layer deeper, such as each node’s resource usage and network IO over time; expanding outward, the dependent storage systems, such as object storage read/write traffic by bucket/prefix, etc.

With these logs and metrics, when problems occur, you can easily obtain the error time based on logs and then perform time retrospection to examine various metrics at that time. After collecting sufficient information, simple cases can be directly located; for more complex cases, if data items were printed, you can use that data item for minimal reproduction, while solidifying this reproduction as a unit test.

Another common debugging need is performance. There may be multiple reasons for this:

The minimum execution unit’s performance is not very good
Resource allocation is unreasonable during scaling, causing resource contention
Upstream and downstream production-consumption speeds don’t match

The latter two are actually common problems in distributed systems. Mature computing platforms can solve these issues at the framework level (backpressure), but this somewhat reduces flexibility. Flexibility and ease of use are always two ends of a trade-off, and due to its design philosophy and somewhat immature state, Ray requires us to handle these problems manually.

Additionally, if there’s unexplained slowdown or even hanging, you can directly log into the machine and use some Linux command-line tools to check system resource usage; use tools like py-spy to see what the code is actually doing; and thus determine what problem the system has and where it’s hanging.

Isolation

From a lifecycle perspective, it can mainly be divided into isolation during the development phase and the runtime phase.

Development Phase

As previously mentioned, during the development phase, multiple developers may share one development machine. Where there is sharing, there is a need for isolation. The most natural solution is to make full use of Linux’s own account system to isolate development machines. Some teams, for convenience, use root across the board. Since everyone has their own development habits, version conflicts can easily arise when installing tools and dependencies later. Therefore, I suggest that everyone avoid using root, and install all software using tools like conda into their own user directories as much as possible. This facilitates migration across different machines on one hand, and avoids affecting other users on the same machine on the other—that is, the isolation we’re discussing now.

Additionally, a single user often faces multiple conflicting environments. Taking Python development as an example, if you install all project dependencies into the user directory, conflicts are inevitable as long as there are multiple projects. Even within the same project, conflicts can occasionally arise when developing across multiple branches. Therefore, tools like virtualenv or conda are also needed to isolate dependencies.

Runtime Phase

With the help of containerization technology, pooling a batch of physical machines and on-demand instant isolation of a logical runtime environment is now trivially easy. Moreover, using container technology and some dependency management toolchains (such as poetry and uv), we no longer need to worry about the dependency inconsistencies between development and production environments that frequently occurred in the past. That is:

Use the same pyproject.toml file to manage project development dependencies
During development, use poetry/uv and this file to install dependencies
During online runtime, use poetry/uv in the Dockerfile for dependency management

Of course, the above only covers Python dependencies; there will certainly be other OS-level library dependencies like CUDA. But the principle is the same—use the Dockerfile as the single source of truth for project dependencies, using the commands in that Dockerfile for both development environments and image building.

More radically, you can even directly use the Dockerfile in the project to temporarily spin up an environment for development, thus achieving complete unity between development and runtime. You can also think about this problem from another angle:

During development, the image should be writable
But in production, the image is read-only

So dependency management during development and runtime is still slightly different.

Summary

Developing in cloud environments changes some paradigms, giving rise to many new development practices. This article only briefly organizes some common problems and practices from the author’s experience, using large-scale data processing on the cloud as an entry point. If it can provide you with a little inspiration, the goal is achieved. Due to limited experience and space, there are certainly areas not covered—welcome to leave comments for discussion and additions.

我是青藤木鸟，一个喜欢摄影、专注大规模数据系统的程序员，欢迎关注我的公众号：“木鸟杂记”，有更多的分布式系统、存储和数据库相关的文章，欢迎关注。关注公众号后，回复“资料”可以获取我总结一份分布式数据库学习资料。回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群，可以添加我的微信号：qtmuniao，我拉你入群。加我时记得备注：“分布式系统群”。另外，如果你不想加群，还有一个分布式系统和数据库的论坛（点这里），欢迎来玩耍。