The IO Subsystem for the Modern, GPU-Accelerated Data Center
The new unit of computing is the data center and at its core are NVIDIA GPUs and NVIDIA networks. Accelerated computing requires accelerated input/output (IO) to maximize performance. NVIDIA Magnum IO™, the IO subsystem of the modern data center, is the architecture for parallel, asynchronous, and intelligent data center IO, maximizing storage and network IO performance for multi-GPU, multi-node acceleration.
Magnum IO, the IO subsystem for data centers, introduces the new enhancements necessary to accelerate IO and the communications supporting multi-tenant data centers, known as Magnum IO for Cloud-Native Supercomputing.
Magnum IO GPUDirect over an InfiniBand network enables Verizon’s breakthrough distributed volumetric video architecture. By placing their technology into edge computing centers, located at sports centers around the United States and in Verizon facilities, they’re able to bring 3D experiences to media and serve up new options for putting you in the game.
Bypasses the CPU to enable direct IO among GPU memory, network, and storage, resulting in 10X higher bandwidth.
Relieves CPU contention to create a more balanced GPU-accelerated system that delivers peak IO bandwidth, resulting in up to 10X fewer CPU cores and 30X lower CPU utilization.
Provides optimized implementation for current and future platforms, whether data transfers are fine-grained and latency-sensitive, coarse-grained and bandwidth-sensitive, or collectives.
Magnum IO utilizes storage IO, network IO, in-network compute, and IO management to simplify and speed up data movement, access, and management for multi-GPU, multi-node systems. Magnum IO supports NVIDIA CUDA-X™ libraries and makes the best use of a range of NVIDIA GPU and NVIDIA networking hardware topologies to achieve optimal throughput and low latency.
[Developer Blog] Magnum IO - Accelerating IO in the Modern Data Center
In multi-node, multi-GPU systems, slow CPU, single-thread performance is in the critical path of data access from local or remote storage devices. With storage IO acceleration, the GPU bypasses the CPU and system memory, and accesses remote storage via 8X 200 Gb/s NICs, achieving up to 1.6Terabits/s of raw storage bandwidth.
Technologies Included:
NVIDIA NVLink® fabric and RDMA-based network IO acceleration reduces IO overhead, bypassing the CPU and enabling direct GPU to GPU data transfers at line rates.
In-network computing delivers processing within the network, eliminating the latency introduced by traversing to the endpoints, and any hops along the way. Data Processing Units (DPUs) introduce software defined, network hardware-accelerated computing, including pre-configured data processing engines and programmable engines.
To deliver IO optimizations across compute, network, and storage, users need advanced telemetry and deep troubleshooting techniques. Magnum IO management platforms empower research and industrial data center operators to efficiently provision, monitor, manage, and preventatively maintain the modern data center fabric.
Magnum IO interfaces with NVIDIA CUDA-X high performance computing (HPC) and artificial intelligence (AI) libraries to speed up IO for a broad range of use cases—from AI to scientific visualization.
Today, data science and machine learning (ML) are the world's largest compute segments. Modest improvements in the accuracy of predictive ML models can translate into billions of dollars to the bottom line. To enhance accuracy, the RAPIDS™ Accelerator library has a built-in accelerated Apache Spark shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. Combined with NVIDIA networking, Magnum IO software, GPU-accelerated Spark 3.0, and RAPIDS, the NVIDIA data center platform is uniquely positioned to speed up these huge workloads at unprecedented levels of performance and efficiency.
Adobe Achieves 7X Speedup in Model Training with Spark 3.0 on Databricks for a 90% Cost Savings
To unlock next-generation discoveries, scientists rely on simulation to better understand complex molecules for drug discovery, physics for new sources of energy, and atmospheric data to better predict extreme weather patterns. Magnum IO exposes hardware-level acceleration engines and smart offloads, such as RDMA, GPUDirect, and NVIDIA SHARP, while bolstering the 400Gb/s high bandwidth and ultra-low latency of NVIDIA Quantum 2 InfiniBand networking.
With multi-tenancy, user applications may be unaware of indiscriminate interference from neighboring application traffic. Magnum IO, on the latest NVIDIA Quantum 2 InfiniBand platform, features new and improved capabilities for mitigating the negative impact on a user’s performance. This delivers optimal results, as well as the most efficient high performance computing (HPC) and machine learning deployments at any scale.
Largest Interactive Volume Visualization - 150TB NASA Mars Lander Simulation
AI models continue to explode in complexity as they take on next-level challenges, such as conversational AI and deep recommender systems. Conversational AI models like NVIDIA’s Megatron-BERT take over 3000X more computing power to train compared to image classification models like ResNet-50. Enabling researchers to continue pushing the envelope of what's possible with AI requires powerful performance and massive scalability. The combination of HDR 200Gb/s InfiniBand networking and the Magnum IO software stack delivers efficient scalability to thousands of GPUs in a single cluster.
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems
Sign up for news and updates.
NVIDIA Privacy Policy
Facilitates IO transfers directly to the GPU memory, removing the expensive data path bottlenecks to and from the CPU/system memory. Avoids the latency overhead of an extra copy through system memory, which impacts smaller transfers and relieves the CPU utilization bottleneck by operating with greater independence.
LEARN MORE ›
Read Blog: GPUDirect Storage: A Direct Path Between Storage and GPU Memory
Watch Webinar: NVIDIA GPUDirect Storage: Accelerating the Data Path to the GPU
Logically presents networked storage, such as NVMe over Fabrics (NVMe-oF), as a local NVMe drive, allowing the host OS/Hypervisor to use a standard NVMe-driver instead of a remote networking storage protocol.
Set of libraries and optimized NIC drivers for fast packet processing in user space, providing a framework and common API for high speed networking applications.
Provides access for the network adapter to read or write memory data buffers directly in peer devices. Allows RDMA-based applications to use the peer device computing power without the need to copy data through the host memory.
Open source, production-grade communication framework for data centric and high performance applications. Includes a low-level interface that exposes fundamental network operations supported by underlying hardware. Package includes: MPI and SHMEM libraries, Unified Communication X (UCX), NVIDIA SHARP, KNEM, and standard MPI benchmarks.
Brings topology-aware communications primitives through tight synchronization between the communicating processors.
Offers a parallel programming interface based on the OpenSHMEM standard, creating a global address space for data spanning the memory of multiple GPUs across multiple servers.
Read Blog: Accelerating NVSHMEM 2.0 Team-Based Collectives Using NCCL
Open-source, production-grade communication framework for data-centric and high performance applications. Includes a low-level interface that exposes fundamental network operations supported by underlying hardware. Also includes a high-level interface to construct protocols found in MPI, OpenSHMEM, PGAS, Spark, and other high performance and deep learning applications.
The set of features that accelerate switch and packet processing. ASAP2 offloads data steering and security from the CPU into the network boosts efficiency, adds control, and isolates them from malicious applications.
The NVIDIA® BlueField DPU® offloads critical network, security, and storage tasks from the CPU, serving as the best solution for addressing performance, networking efficiency, and cyber security concerns in the modern data center.
Reduces MPI communication time and improves overlapping between compute and communications. Employed by NVIDIA Mellanox InfiniBand adapters to offload the processing of MPI messages from the host machine onto the network card, enabling a zero copy of MPI messages.
Improves upon the performance of data reduction and aggregation algorithms, such as in MPI, SHMEM, NCCL, and others, by offloading these algorithms from the GPU or the CPU to the network switching elements, and eliminating the need to send data multiple times between endpoints. SHARP integration boosts NCCL performance by 4X and demonstrates a 7X performance increase for MPI collectives latency.
Enables network orchestration, provisioning, configuration management, task management, in-depth visibility into fabric health, traffic utilization, and management for Ethernet solutions.
Provides debugging, monitoring, management, and efficient provisioning of fabric in data centers for InfiniBand. Supports real-time network telemetry with AI-powered cyber Intelligence and analytics.