In my last blog post I gave a high level overview on artificial intelligence in order to clear the confusion there may be around the whole topic. As a network engineer today I want to talk more about the implications I understand it has on the network. We will go over key topics such as GPU-to-GPU communication, NCCL as well as Infiniband, RDMA, RoCE that have been implemented in the industry for HPC/AI networks.
What are the requirements for a network carrying ML/AI Workloads ?
Requirements for ML / AI workloads are :
- High bandwidth : Amazon's P5 instance for example supports up to 3,200 Gbps aggregated network bandwidth. The network needs to be non-blocking meaning the it can handle the maximum possible bandwidth across all of its ports simultaneously, without any of the data transfers hindering or blocking the others.
- Low-Latency : Low latency is crucial for HPC workloads, especially in AI model training, because it ensures fast communication between distributed computing resources. This rapid exchange of data between nodes or GPUs accelerates the iterative processes of training, reducing the time to convergence for complex models. Latencies can vary but are often in the order of microseconds (µs) to milliseconds (ms).
- Lossless network : Even a small amount of loss in data packets during the training of AI models can extend the time required for training. This is why mechanisms like TCP congestion control do not suit these networks as it needs to encounter loss before throttling the transport rate.
GPU-to-GPU Communication
One element to understand when it comes to machine learning workloads is that the computations performed by these complex algorithms are powered by GPU's in contrast to traditional workloads that are powered by CPU's. Training a model like ChatGPT for example requires substantial amounts of compute power, this is to speed up the model training time. The training time depends on multiple aspects :- Model complexity : Just for reference, ChatGPT has around 175 billion parameters. Parameters are the small knobs that the model has to adjust and fine-tune across each computation. For example, in a model trying to predict pictures of cats and dogs, one neuron might be trying to detect the presence of specific shapes relevant to cats versus dogs while another neuron might be trying to recognize the texture of fur. Adjusting each parameter over and over again allows the model to learn from data and improve its prediction to a point where it becomes very accurate. More info about Neural networks in my previous blog post.
- Dataset size : The larger the dataset used for training, the more time required to process all the data.
- Hardware capabilities : The speed or memory of each GPU largely influence the training time, That's why Nvidia's making bank today :) They are leading the industry in terms of AI optimized GPU technologies and everything related to it.
- Parallelism and Distribution: As machine learning models grow in complexity and size, we can't rely on training a model with just one GPU, so we need to scale the number of GPUs based on the computational demands. Because GPUs have a very large number of cores, we can run the computations required in parallel. However, this parallel processing comes with the challenge of ensuring that data and learning processes are synchronized and distributed across all GPUs involved in the training job. These communications between GPUs are orchestrated by libraries like NCCL (NVIDIA Collective Communications Library) and MPI (Message Passing Interface), which will be described below."
Why are GPU's required for ML workloads ?
One key difference between GPU's and CPU's are that CPU's have a smaller number of cores that are optimized for sequential processing. GPU's have a large number of cores, that makes them ideal for parallel computing which is crucial improving the performing of the complex deep learning algorithms used to train models like chatGPT that we all know.Because of the need of massive amounts of compute there are 2 well-known design patterns that are used :
- Vertical scaling : Vertical scaling means increasing the capacity of a single resource, such as a server to handle more load.
- Horizontal scaling : Horizontal scaling means increasing the capacity of the whole system by adding multiple servers to the same logical system/cluster.
Horizontal scaling is where the network comes into the picture. As John Gage, chief scientist and master geek at Sun Microsystems once said "The network is the computer." There is only so much you can achieve by scaling vertically, the true power of computing is when we interconnect multiple individual systems across the network that are orchestrated to behave as one single system. Using the network to train large and complex models in a distributed way requires specific design patterns as well as taking into account the nature of these workloads.
NCCL: The Backbone of Multi-GPU Training
- AllReduce : Combines data from all participating GPU’s into a single value and copies it back to each GPU. Useful for tasks like calculating the total loss during training or summarizing statistics across all data.
- Reduce : Combines data from all participating GPU’s into a single value and sends it to one GPU
- Broadcast : Sends data from one GPU to all other participating GPU’s. Ideal for sharing initial model parameters or sharing important updates to all GPU’s.
- Gather : Gathers data from all GPUs and distributes it to one GPU, resulting in that GPU having a copy of all the data from all other GPUs.
- AllGather : Gathers data from all GPUs and distributes it to all GPUs, resulting in each GPU having a copy of all the data from all other GPUs
- ReduceScatter : Gathers data from all GPUs then scatters it across all GPU’s
- Scatter : Sends data from one GPU and distributes it across multiple GPU’s
- Send/Recv : direct data transfer between 2 GPUs by creating a ring topology involving all GPU’s. Each one sends to the next one in the ring and receives from the previous one.
All-gather gathers data from all GPUs and distributes it to each other GPU in the training job so every GPU is aware of what data other GPUs have. All-Reduce does more or less the same but it combines data from all GPU's before distributing it to other GPU's. ML clusters can use thousands of GPU's. Consider Nvidia's DGX platform for example, which has 8x Nvidia H100 GPU's per node, if we have a cluster of 128 nodes that means we have 1024 GPU's that need to exchange, combine and synchronize data between each other, that put's a lot of stress on the network.
What is PyTorch and TensorFlow and how where does NCCL fit in the picture ?
InfiniBand : interconnecting HPC hosts
- Protocol overhead : TCP/IP includes mechanisms for ensuring data integrity, such as error checking, packet reordering, retransmission, congestion control. These are very useful in traditional workloads but they add processing overhead which makes the network become the bottleneck in HPC environments.
- Congestion Management : TCP/IP congestion control mechanism has to encounter packet loss or retransmission to kick in and adapt to congestion, this doesn't work in AI/HPC networks which have to be lossless. This makes ethernet a lossy fabric.
- Non-predictable nature : Travel of data from source to destination can vary based on multiple factors such as congestion, routing changes etc. This variability is problematic for HPC workloads which require consistent and predictable performance.Developed in 2000 to address Ethernet's shortcomings, InfiniBand takes a different approach, focusing on application-centric communication. Unlike Ethernet, where data must traverse from the application to the OS, then to the network stack and finally onto the wire (and vice versa at the destination).
This approach substantially reduces layers of communication required to transmit data from one application to the other. Infiniband key features are :- It's extremely low latency achieved by hardware offloading and accelerating mechanisms, as a result, end to end communication between application can be as low as 1 microsecond.
- It's scalability, allowing a single infiniband subnet to scale up to 48000 nodes.
- Quality of service : Packets can be mapped to different "channel" and infiniband can assign important traffic to dedicated channels so that other traffic does not consume the bandwidth required for business critical applications.
Infiniband Architecture
Infiniband's architecture is divided into 5 layers similarly to TCP/IP architecture, Let's discuss each layer from top to bottom :
Upper layer
Applications are consumer of the infiniband message service. The top layer is where methods are defined for an application to access a set of services provided by Infiniband. This is where libraries like NCCL or MPI live.
Transport layer
Link Layer
The switches forwarding table are populated by the subnet manager. The subnet manager is sort of the brain of the network, that holds all the logic required to switch packets from 1 LID to another, it maps destination LID's to exit ports and programs it in the switches hardware.It works a bit like the FIB in a regular switch. :
Excellent article providing an overview for a newby to the field like me. I know networking but nothing on AI/ML. This connects the dots. Thanks a lot. Keep up the good work. Looking forward for more content
ReplyDeleteThanks Ashen for the kind words, happy it helped !
Delete