The Role of Networking in AI Workloads: An Overview

In my last blog post I gave a high level overview on artificial intelligence in order to clear the confusion there may be around the whole topic. As a network engineer today I want to talk more about the implications I understand it has on the network. We will go over key topics such as GPU-to-GPU communication, NCCL as well as Infiniband, RDMA, RoCE that have been implemented in the industry for HPC/AI networks.

What are the requirements for a network carrying ML/AI Workloads ?

Requirements for ML / AI workloads are :

High bandwidth : Amazon's P5 instance for example supports up to 3,200 Gbps aggregated network bandwidth. The network needs to be non-blocking meaning the it can handle the maximum possible bandwidth across all of its ports simultaneously, without any of the data transfers hindering or blocking the others.
Low-Latency : Low latency is crucial for HPC workloads, especially in AI model training, because it ensures fast communication between distributed computing resources. This rapid exchange of data between nodes or GPUs accelerates the iterative processes of training, reducing the time to convergence for complex models. Latencies can vary but are often in the order of microseconds (µs) to milliseconds (ms).
Lossless network : Even a small amount of loss in data packets during the training of AI models can extend the time required for training. This is why mechanisms like TCP congestion control do not suit these networks as it needs to encounter loss before throttling the transport rate.

Let's see what creates the conditions for these requirements and what technolgies help achieve them.

GPU-to-GPU Communication

One element to understand when it comes to machine learning workloads is that the computations performed by these complex algorithms are powered by GPU's in contrast to traditional workloads that are powered by CPU's. Training a model like ChatGPT for example requires substantial amounts of compute power, this is to speed up the model training time. The training time depends on multiple aspects :

Model complexity : Just for reference, ChatGPT has around 175 billion parameters. Parameters are the small knobs that the model has to adjust and fine-tune across each computation. For example, in a model trying to predict pictures of cats and dogs, one neuron might be trying to detect the presence of specific shapes relevant to cats versus dogs while another neuron might be trying to recognize the texture of fur. Adjusting each parameter over and over again allows the model to learn from data and improve its prediction to a point where it becomes very accurate. More info about Neural networks in my previous blog post.
Dataset size : The larger the dataset used for training, the more time required to process all the data.
Hardware capabilities : The speed or memory of each GPU largely influence the training time, That's why Nvidia's making bank today :) They are leading the industry in terms of AI optimized GPU technologies and everything related to it.
Parallelism and Distribution: As machine learning models grow in complexity and size, we can't rely on training a model with just one GPU, so we need to scale the number of GPUs based on the computational demands. Because GPUs have a very large number of cores, we can run the computations required in parallel. However, this parallel processing comes with the challenge of ensuring that data and learning processes are synchronized and distributed across all GPUs involved in the training job. These communications between GPUs are orchestrated by libraries like NCCL (NVIDIA Collective Communications Library) and MPI (Message Passing Interface), which will be described below."

Why are GPU's required for ML workloads ?

One key difference between GPU's and CPU's are that CPU's have a smaller number of cores that are optimized for sequential processing. GPU's have a large number of cores, that makes them ideal for parallel computing which is crucial improving the performing of the complex deep learning algorithms used to train models like chatGPT that we all know.
Because of the need of massive amounts of compute there are 2 well-known design patterns that are used :

Vertical scaling : Vertical scaling means increasing the capacity of a single resource, such as a server to handle more load.
Horizontal scaling : Horizontal scaling means increasing the capacity of the whole system by adding multiple servers to the same logical system/cluster.

Horizontal scaling is where the network comes into the picture. As John Gage, chief scientist and master geek at Sun Microsystems once said "The network is the computer." There is only so much you can achieve by scaling vertically, the true power of computing is when we interconnect multiple individual systems across the network that are orchestrated to behave as one single system. Using the network to train large and complex models in a distributed way requires specific design patterns as well as taking into account the nature of these workloads.

NCCL: The Backbone of Multi-GPU Training

As mentioned above, ML models harness GPUs and their large number of cores to parallelize operations and achieve faster training time. Because all GPU's are each running their own tasks, they need to synchronize and aggregate data at certain points in time to ensure data integrity, This can be achieved using NCCL (for Nvidia GPU's, for other GPU vendors there are other libraries such as MPI). NCCL which Stands for Nvidia collective communications library enables deep learning frameworks such as Pytorch or TensorFlow to coordinate communications between Nvidia GPUs in a distributed manner as quickly and as efficiently as possible. These communication patterns, also known as collective communication operations , are essential in parallel and distributed computing. Some of the key collective operations include :

AllReduce : Combines data from all participating GPU’s into a single value and copies it back to each GPU. Useful for tasks like calculating the total loss during training or summarizing statistics across all data.
Reduce : Combines data from all participating GPU’s into a single value and sends it to one GPU

Broadcast : Sends data from one GPU to all other participating GPU’s. Ideal for sharing initial model parameters or sharing important updates to all GPU’s.

Gather : Gathers data from all GPUs and distributes it to one GPU, resulting in that GPU having a copy of all the data from all other GPUs.

AllGather : Gathers data from all GPUs and distributes it to all GPUs, resulting in each GPU having a copy of all the data from all other GPUs

ReduceScatter : Gathers data from all GPUs then scatters it across all GPU’s

Scatter : Sends data from one GPU and distributes it across multiple GPU’s

Send/Recv : direct data transfer between 2 GPUs by creating a ring topology involving all GPU’s. Each one sends to the next one in the ring and receives from the previous one.

NCCL, also known as Nvidia Collective Communications Library, is a library created to optimize and coordinate the collective communication operations explained above between Nvidia GPU's, these operations are crucial in HPC environments for efficiently distributing and synchronizing in a parallel manner, which is required during large model training jobs. Some of the operations mentioned above, such as all-reduce and all-scatter are data-hungry thus bandwidth intensive and can reveal bottlenecks in the network if it's not designed and optimized for these operations. For example, here is another representation of what happens during all-gather and all-reduce :

All-gather gathers data from all GPUs and distributes it to each other GPU in the training job so every GPU is aware of what data other GPUs have. All-Reduce does more or less the same but it combines data from all GPU's before distributing it to other GPU's. ML clusters can use thousands of GPU's. Consider Nvidia's DGX platform for example, which has 8x Nvidia H100 GPU's per node, if we have a cluster of 128 nodes that means we have 1024 GPU's that need to exchange, combine and synchronize data between each other, that put's a lot of stress on the network.

What is PyTorch and TensorFlow and how where does NCCL fit in the picture ?

PyTorch and TensorFlow are frameworks that enable developers to build neural networks, NCCL fits in as backend to both and each library abstracts the collective communication part from the developer. This means that depending on the deep learning algorithm configured in PyTorch or TensorFlow, those deep learning frameworks will choose which GPU communication pattern will be the best for the use case. This means that when you use distributed training features in these frameworks, you don't need to manually manage the details of how data is synchronized or communicated between different GPUs or nodes.

InfiniBand : interconnecting HPC hosts

If you've been exploring AI networking, you've likely noticed that InfiniBand is a hot topic in this space. It's particularly relevant for AI and high-performance computing (HPC) workloads. Let's take a bird's eye view to understand why it's so crucial.Starting with Ethernet, it's been the standard in networking since 1980, powering most local area networks. Ethernet is fundamental, using frames for data transmission and MAC addresses to identify devices, with speeds ranging from 10 Mbps to beyond 100 Gbps. However, in the world of high-performance computing, which is critical for AI, Ethernet often falls short due to several limitations:

Protocol overhead : TCP/IP includes mechanisms for ensuring data integrity, such as error checking, packet reordering, retransmission, congestion control. These are very useful in traditional workloads but they add processing overhead which makes the network become the bottleneck in HPC environments.
Congestion Management : TCP/IP congestion control mechanism has to encounter packet loss or retransmission to kick in and adapt to congestion, this doesn't work in AI/HPC networks which have to be lossless. This makes ethernet a lossy fabric.
Non-predictable nature : Travel of data from source to destination can vary based on multiple factors such as congestion, routing changes etc. This variability is problematic for HPC workloads which require consistent and predictable performance.Developed in 2000 to address Ethernet's shortcomings, InfiniBand takes a different approach, focusing on application-centric communication. Unlike Ethernet, where data must traverse from the application to the OS, then to the network stack and finally onto the wire (and vice versa at the destination).

Infiniband however, uses RDMA (Remote Direct Access Memory) to bypass the operating system and create a direct "channel" for direct communication between applications :

This approach substantially reduces layers of communication required to transmit data from one application to the other. Infiniband key features are :
It's extremely low latency achieved by hardware offloading and accelerating mechanisms, as a result, end to end communication between application can be as low as 1 microsecond.
It's scalability, allowing a single infiniband subnet to scale up to 48000 nodes.
Quality of service : Packets can be mapped to different "channel" and infiniband can assign important traffic to dedicated channels so that other traffic does not consume the bandwidth required for business critical applications.

Infiniband Architecture

Infiniband's architecture is divided into 5 layers similarly to TCP/IP architecture, Let's discuss each layer from top to bottom :

Upper layer

Applications are consumer of the infiniband message service. The top layer is where methods are defined for an application to access a set of services provided by Infiniband. This is where libraries like NCCL or MPI live.

Transport layer

InfiniBand's transport layer is about creating a virtual channel directly between applications, bypassing the OS entirely. This is achieved through hardware-based transport services in the NICs, known as HCAs, conserving valuable CPU resources.

Network Layer

Here where it get's to the cool stuff that interests us network engineers. The upper layer and transport layer were focused on individual host, but something has to route/switch/deliver the data from one host to the other, that is the role of the network layer. Infiniband routers are used to interconnect different Infiniband subnets which allow us to isolate traffic and scale effectively. Routers use Source and Destination GID's (The network layer addresses for infiniband) to deliver packets to destination node :

Link Layer

What if devices are in the same local subnet ? No need to route right ? Yes, thats where the link layer comes in. Infiniband assigns a LID address (Local ID) to each node in the subnet. LID's are assigned and maintained by a subnet manager, if you've read a bit about Infiniband you often hear "InfiniBand network is the natural SDN network", this is one of the reasons. InfiniBand networks utilize a centralized management architecture, which is a core tenet of SDN :

The switches forwarding table are populated by the subnet manager. The subnet manager is sort of the brain of the network, that holds all the logic required to switch packets from 1 LID to another, it maps destination LID's to exit ports and programs it in the switches hardware.It works a bit like the FIB in a regular switch. :

Link layer also has it's own flow control mechanism. it's used to adjust transmission rate between sender and receiver, similarly to TCP window mechanism. The sending node tracks receive buffer usage and transmits data only if there is enough space. This is what makes Infiniband a "Lossless network"

Physical layer

After all this is done you still have to send data on the wire right? This layer is where it happens. Examples of cables used for infiniband interconnects are DAC (Direct Attach Copper), used for short-range connections, such as within racks or between adjacent racks in data centers or AOC (Active Optical Cables) used for longer distance InfiniBand connections.

Infiniband seems like the interconnect of choice here, right ? Well it depends on the use case, Infiniband has some downsides :
Cost: InfiniBand hardware and cables can be more expensive due to their specialized nature.
Infrastructure: Ethernet's widespread adoption means more existing infrastructure, making InfiniBand less convenient for integration.
Compatibility: InfiniBand may require specific configurations and support, potentially complicating setups in environments standardized on Ethernet.
Scalability: While highly scalable within its ecosystem, InfiniBand's less ubiquitous presence can make scaling beyond specialized applications or environments more challenging.
Support: Ethernet's broad adoption ensures extensive vendor and community support, which might be less readily available for InfiniBand outside specific high-performance computing contexts.
Due to these limitations, some work has been done to enable RDMA over converged ethernet (RoCE) to achieve similar capabilities as Infiniband but using widespread ethernet.

RoCE: Bridging RDMA and Ethernet for AI

RoCE stands for RDMA over converged ethernet, the first implementation of it was called RoCEv1. RoCEv1 was originally a layer protocol. First let's check out the what Infiniband's packet format looks like and what happens on the link layer :

LRH (Local Route Headers) are Infinibands link layer headers that contain the source and destination LID as discussed above. GRH (Global route headers) is a network layer header used when the packet has to be routed via an Infiniband router. RoCE V1 replaces the LRH with an Ethernet MAC header making it switchable on layer 2 ethernet networks :

This is great because we can benefit from RDMA as well as Ethernet capabilities in the same Layer 2 network. This was initially designed for Layer 2 Networks which means it is not routable, that's why RoCEv2 was developed. RoCEv2 was developed to extend RoCEv1 in order to be transported on layer 3 networks. It does so by simply continuing up the stack and replacing Infiniband's GRH header with the standard IP header.

RoCE can be implemented either over a lossy network, called Resilient RoCE which I won't cover here or over a lossless network which is a requirement for ML workloads, RoCE achieves that by using DCQCN.

DCQCN : RoCE's way of preventing packet loss

DCQCN (Data Center Quantized Congestion Notification) is a congestion control protocol specifically designed for RoCE networks. DCQCN was developed because Ethernet does not provide the same congestion control mechanisms inherent in Infiniband. DCQCN uses Explicit Congestion Notification (ECN) and Priority-based Flow Control (PFC) to achieve prevent packet loss, let's see what that means and how it works :

Explicit Congestion Notification (ECN)

ECN marking happens in the IP header of the RoCEv2 packet. As packets enter an ethernet switch it's buffer queue rises, with ECN we configure a buffer utilization threshold which, if exceeded, marks the packet with the CE value (congestion encountered) in the ECN field of the IP header in the RoCEv2 packet. This differs from TCP where TCP notifies the network by dropping packets while ECN notifies the network of congestion before it starts dropping packets.

Priority-based Flow Control (PFC)

PFC comes as an enhancement to the Ethernet pause frame where the receiving node can send a PAUSE frame if it is detecting congestion, which halts the transmission of the sender for a specified period of time. This will allow the receiving node time to recover from the congestion state. PFC goes one step further and implements the same pause mechanism but for different COS. This means that only traffic belonging to a specific class (or priorities) can be paused, while other traffic continues to flow. This capability is crucial in environments that mix loss-sensitive traffic (like RDMA) with other types of traffic. By selectively pausing only the traffic that is experiencing congestion, PFC helps maintain lossless transport for high-priority flows without impacting the entire network's throughput.

Both of those mechanism together form the DCQCN congestion control scheme which is required to achieve the lossless nature that are required by ML/HPC workloads.

RoCE vs Infiniband

In the world of data centers that have customers training very large language models, you're often caught choosing between InfiniBand and Ethernet, RoCE included. InfiniBand offers top-notch speed and minimal delay, perfect for when performance can't be compromised, but it does come with a higher cost and a bit of a learning curve. RoCE, particularly RoCEv2, strikes a balance. It brings some of that high-performance magic of InfiniBand to the familiar terrain of Ethernet, making it a great pick for data centers aiming to amp up their processing power without a major network overhaul. RoCE is ideal when you want to integrate HPC workloads in a broader network infrastructure scheme while Infiniband is better when you want to have your little "compute islands".

Conclusion

Phew, that was a lot, right? 😄 We just dipped our toes into the massive ocean of HPC/AI networking. I tried to keep it high-level to avoid turning this into a novel, but there’s so much more we could talk about. Got thoughts or questions? Hit me up on LinkedIn or wherever you like to chat. Let’s keep this conversation going!

TechShinobi

Search This Blog