Past GPU Memory Limits with Unified Memory On Pascal

Trendy computer architectures have a hierarchy of reminiscences of varying size and efficiency. GPU architectures are approaching a terabyte per second memory bandwidth that, coupled with excessive-throughput computational cores, creates a great system for information-intensive tasks. Nevertheless, everybody is aware of that quick memory is costly. Modern applications striving to resolve bigger and larger issues can be limited by GPU memory capability. For the reason that capability of GPU memory is considerably lower than system memory, it creates a barrier for developers accustomed to programming just one memory area. With the legacy GPU programming model there is no straightforward method to "just run" your application when you’re oversubscribing GPU Memory Wave Program. Even if your dataset is only barely bigger than the accessible capability, you'll still need to handle the active working set in GPU memory. Unified Memory is a much more intelligent memory administration system that simplifies GPU growth by providing a single memory house immediately accessible by all GPUs and CPUs in the system, with automatic web page migration for data locality.

Migration of pages allows the accessing processor to profit from L2 caching and the lower latency of native memory. Moreover, migrating pages to GPU memory ensures GPU kernels reap the benefits of the very high bandwidth of GPU memory (e.g. 720 GB/s on a Tesla P100). And page migration is all utterly invisible to the developer: the system mechanically manages all data motion for you. Sounds great, right? With the Pascal GPU architecture Unified Memory is even more powerful, Memory Wave due to Pascal’s bigger digital memory tackle area and Page Migration Engine, enabling true virtual memory demand paging. It’s also price noting that manually managing memory motion is error-prone, which affects productiveness and delays the day when you possibly can lastly run your entire code on the GPU to see these great speedups that others are bragging about. Developers can spend hours debugging their codes due to memory coherency issues. Unified memory brings big benefits for developer productiveness. On this put up I will show you how Pascal can enable applications to run out-of-the-box with larger memory footprints and obtain nice baseline efficiency.

For a moment you can utterly forget about GPU memory limitations whereas growing your code. Unified Memory was introduced in 2014 with CUDA 6 and the Kepler architecture. This comparatively new programming model allowed GPU functions to make use of a single pointer in each CPU functions and GPU kernels, which enormously simplified memory management. CUDA eight and the Pascal structure considerably improves Unified Memory functionality by including 49-bit virtual addressing and on-demand page migration. The massive 49-bit digital addresses are enough to allow GPUs to access your complete system memory plus the memory of all GPUs within the system. The Page Migration engine permits GPU threads to fault on non-resident memory accesses so the system can migrate pages from wherever in the system to the GPUs memory on-demand for environment friendly processing. In different phrases, Unified Memory transparently enables out-of-core computations for Memory Wave any code that is using Unified Memory for allocations (e.g. cudaMallocManaged()). It "just works" with none modifications to the appliance.

CUDA eight also provides new ways to optimize information locality by offering hints to the runtime so it remains to be attainable to take full control over data migrations. Today it’s arduous to discover a high-performance workstation with just one GPU. Two-, four- and eight-GPU systems have gotten widespread in workstations in addition to giant supercomputers. The NVIDIA DGX-1 is one instance of a high-efficiency integrated system for deep learning with eight Tesla P100 GPUs. In the event you thought it was tough to manually manage information between one CPU and one GPU, now you may have eight GPU memory spaces to juggle between. Unified Memory is essential for such systems and it enables more seamless code improvement on multi-GPU nodes. Every time a selected GPU touches knowledge managed by Unified Memory, this information might migrate to local memory of the processor or the driver can establish a direct access over the out there interconnect (PCIe or NVLINK).