How to Virtualize Neural Processing Units for Cloud Platforms?

Modern cloud platforms have deployed neural processing units (NPUs) to meet the increasing demand for machine learning (ML) services. However, the current way of using NPUs in cloud platforms suffers from either low resource utilization or poor isolation between multi-tenant applications due to the lack of system virtualization support for fine-grained resource sharing. The de facto standard in the cloud to offer users NPUs is to exclusively assign one NPU device to one user VM via PCIe pass-through, preventing other users from sharing the same NPU. This inevitably leads to underutilized hardware if the single ML workload cannot fully utilize the NPU. To better utilize NPUs, modern cloud platforms implement limited virtualization supports for NPUs. They enable the time-sharing of an NPU device at the task level and the task preemption to allocate the NPU to prioritized users. However, this coarse-grained time-multiplexing on a single NPU board still suffers significant resource underutilization, because it does not support the concurrent execution of multi-tenant DNN workloads or the fine-grained resource allocation on NPU cores. They cannot leverage multi-tenant workloads to improve utilization, and none of the sharing mechanisms provide sufficient security or performance isolation in a multi-tenant cloud environment.

To understand NPU utilization in the cloud, we thoroughly investigate a real Google Cloud TPUv2 and test various DNN workloads. Inside the TPU core, we profile the resource utilization of its main components: the SA and the VU. We find that most DNN inference workloads significantly underutilize the compute resources on the TPU core. The reason is that many DNN inference workloads have imbalanced demands on SAs and VUs. They are either SA-intensive or VU-intensive. As a result, SA-intensive workloads inevitably underutilize VUs and vice versa.

We first developed V10, a hardware-assisted multi-tenant NPU (ISCA'23). It enables fine-grained concurrent execution of ML workloads with an operator scheduler in the NPU, which exploits the idle cycles caused by the imbalanced use of SAs and VUs in an ML kernel and enables concurrent execution of operators from different ML workloads. V10’s hardware scheduler offers the flexibility for enforcing different priorities to satisfy different service-level agreements (SLAs) for ML services. The scheduler guarantees that a workload should spend computation cycles proportional to its relative priority. The scheduler will prioritize the workload that currently suffers from the lowest share of resources with respect to its priority. With this scheduling policy, V10 dynamically controls the resource allocation to each workload.

Based on the hardware-assisted multi-tenant NPU, we propose a full-stack NPU virtualization solution with fine-grained hardware resource management with the vNPU abstraction. For optimal resource efficiency, we must allocate different numbers of SAs and VUs to a DNN workload based on its demands. The vNPU abstraction must provide flexibility for the user to customize a vNPU. To minimize changes to guest software stacks, a vNPU has the same structure as a physical NPU board. A vNPU instance reflects the hierarchy of a physical NPU board to minimize changes to existing compiler/driver stacks. Each vNPU is exposed to the guest VM as a PCIe device that resembles an NPU board. The guest NPU driver can query the hierarchy of the emulated vNPU, such as the number of chips, cores per chip, etc. The guest ML framework can handle data distribution according to the vNPU configuration, the same as how it handles a bare-metal NPU board.

To optimize NPU utilization while guaranteeing service level objectives (SLOs), we should assign a proper combination of SAs, VUs, and SRAM/HBM to a vNPU instance. After that, the vNPU manager takes the SA/VU configuration and requested memory size as the input. It aims to fit as many vNPUs as possible on a physical NPU to maximize the utilization of both the compute and memory resources. To maximize the compute utilization, the vNPU manager groups vNPUs, such that the total number of compute units of all vNPUs is as large as possible without exceeding the number of available compute units on an NPU core. We allow the oversubscription of compute units by mapping more vNPUs to the core. Therefore, when a vNPU is idle, another vNPU can utilize the compute units. When multiple vNPU instances require the same compute unit, the vNPU scheduler will decide which vNPU can be executed. As multiple application instances share the same NPUs, we enforce runtime isolation between vNPU instances. This includes security isolation to prevent malicious attacks, and performance isolation to provide SLO guarantees.

NPU virtualization can be integrated in modern cloud platforms. For instance, we can integrate the vNPU manager into the KVM hypervisor, which exposes the vNPU instances as mediated PCIe devices to the guest VM. The guest NPU driver also needs modifications to be aware of the para-virtualized interface like KVM hypercalls. As KVM supports hot plugging of PCIe devices, it provides the system support for vNPU allocation and (re)mapping. We can also extend the Kubernetes scheduler to implement vNPU mapping policies. The kube-scheduler will assign a score to each available NPU node, and then rank all the nodes. After that, the VM is assigned to the node with the highest score. We can extend the scoring mechanism to rank the nodes by their remaining NPU hardware resources, such as the amount of free cores, compute units, and memory.

To learn more about NPU virtualization, please check our recent publications as follows.