How to Make Unified GPU Memory/Storage Architecture Truly Usable for AI?

As we utilize GPUs for scaling deep learning workloads with large-scale data sets, we are facing the well-known memory wall. Although GPUs provide increasing parallelisms, their on-board memory capacity is still limited, due to the space and power constraints, as well as DRAM scaling issues. At the same time, the deep neural network (DNN) models, which have become the killer applications of GPUs, are demanding a growing amount of memory for achieving improved training efficiency and scalability. For instance,the largest DNNs today typically work with TBs of data (e.g., tensors), however, the state-of-the-art GPU device only has tens of GBs on-board DRAM. This gap will only be enlarged if not addressed properly.

To overcome the GPU memory wall, a promising and practical approach is to expand the limited GPU memory with flash memory, which provides larger memory capacity at a low cost. With this approach, a few architecture solutions have been developed in both academic and industry. For example, ZnG directly replaces the GPU on-board DRAM with low-latency flash chips, and AMD’s SSG integrated flash-based solid-state drives (SSDs) into the GPU board. Unfortunately, the limited bandwidth of flash chips is the performance bottleneck, especially in comparison with the high-bandwidth memory in GPUs. An alternative approach is to use external (off- board) flash-based SSD to back the GPU on-board memory, forming a heterogeneous memory and storage system. For example, GPU vendors have been connecting GPUs with SSDs via PCIe links to bypass the host CPU, and allowing direct data movement between the SSD and GPU.

However, these existing solutions still suffer from suboptimal performance. Specifically, the end-to-end application performance is still bottlenecked by the SSD bandwidth. Although we can scale up the SSD bandwidth by stacking multiple SSD devices or flash chips, the aggregated bandwidth is still limited by the PCIe interface. Although we can upgrade the standard PCIe with recent interconnect techniques such as NVLink, the bandwidth of data transfer is still much smaller than the on-chip memory bandwidth in GPUs. To tolerate slow flash accesses, developers have to carefully manage the data across the heterogeneous memory devices to explore the data locality of workloads. This inevitably complicates the GPU memory management and hurts the development productivity.

Ideally, we wish to transparently expand the GPU memory using low-cost flash memory, while achieving similar performance as that of the GPU with unlimited on-chip memory. According to our characterization study of diverse DNN models, we believe this is feasible. In particular, our study discloses that (1) only a small portion (less than 10%) of tensors are active in each iteration during the deep learning training, and (2) a majority of inactive tensors will remain inactive for a long period of time, which offers sufficient opportunities for us to move tensor data around across heterogeneous memory devices. Therefore, if we can intelligently move inactive tensors from the fast GPU memory to the slow memories (i.e., host memory and flash memory), we can not only improve the utilization of precious fast GPU memory but also hide the data access overheads introduced by the slow memories.

We present a unified GPU memory and storage architecture named G10 driven by the fact that the tensor behaviors of deep learning workloads are highly predictable. G10 integrates the host memory, GPU memory, and flash memory into a unified memory space, for scaling the GPU memory capacity while enabling transparent data placement and migration. Based on this unified GPU memory and storage architecture, G10 utilizes compiler techniques to characterize the tensor behaviors in a deep learning workload. Therefore, it can schedule data placement and migration in advance in consideration with the available bandwidth of flash memory and host memory. The cooperation mechanism between deep learning compilers and the unified memory architecture enables G10 to hide the data transfer overheads in a transparent manner.

We presented G10 at MICRO'23, the full paper is available as follows: