Distribtued AI

The Bandwidth Bottleneck

One of the main obstacles for scaling distributed training systems is the heavy communication cost of sharing either the model states or gradients of every node in the system each optimiser step.

If, for example, we were attempting to train a FP32 1b parameter model across 250 nodes each of these nodes would have to share their 4GB local model states with all 249 other nodes every optimiser step. This would be extremely time consuming causing a lot of idle compute time across the global system. One can also imagine, how this becomes completely unfeasible when trying to scale to 100b/1000b parameter models.

Traditionally the most promising distributed systems of the past have attempted to address this issue by using a mechanism called Butterfly All-Reduce which reduces the maximum bandwidth required by each node using an efficient communication protocol where nodes split their models into N different shards (N being the total number of nodes in the system) and only share one of these shards at a time. Therefore reducing the required bandwidth by 1/log₂(N).

Whilst this algorithm initially did well to tackle communication bottlenecks, the rapid growth in the size of SOTA models over time has meant that this reduction in bandwidth cost has not been enough to allow distributed training systems to catch up. A case in point is that if we were to scale our example to encompass a distributed training system of 1000 nodes, Butterfly All-Reduce would only reduce the max bandwidth requirement by 1/10.

Thankfully, over the past 12 months there has been a flurry of great publications addressing this exact issue. Two of the most promising of these advancements are DeMo (Decoupled Momentum Optimization) and DiLoCo (Distributed Low-Communication Training of Language Models) which we’re going to be covering in more detail in the rest of this blog.

Whilst each method tries to reduce the communication cost by 500-1000x in different ways they both break from the assumption that all nodes within a distributed training system have to be fully synced and they open the door to a world where nodes can have varying local model states as long as they periodically synchronise with each other.

DiLoCo

DiLoCo was published by Arthur Douillard et al. in 2023 and in mid-2024 the talentedteam at PrimeIntellect published OpenDiloCo an open source version of the original paper. DiLoCo aims to reduce the communication burden amongst nodes by reducing the frequency of all-reduce steps by up to 500x.

The algorithm achieves this by introducing the concept of pseudo gradients and allowing each local node to update its local model weights each optimizer step. After a number of these steps, which are referenced as inner steps, complete nodes compute their pseudo gradients by calculating the difference between their current model weights and the model weights at inner step 0.

Nodes then all-reduce their pseudo gradients and use a second optimiser, called the outer optimiser, to adjust their model weights based on those averaged pseudo gradients. The OpenDiloco paper shows that even for a single node applying the DiLoCo protocol leads to faster loss convergence and increasing the number of nodes in the system leads to even faster loss convergence.

By setting the number of inner steps per outer step to 500, this protocol is therefore able to reduce the bandwidth demand by 500x. Although, it’s important to realise that this doesn't reduce the maximum bandwidth required by each node in the system. It simply reduces the frequency at which gradients are all-reduced. When an all-reduce does occur it still requires the same amount of bandwidth as butterfly all-reduce

There are a number of live projects currently implementing DiLoCo in the wild. Here at DSTRBTD we’re running a modified version of OpenDiloco’s original code on Bittensor’s Subnet 38 pre-training a 1b parameter model from scratch. The folks at Prime Intellect are RL training a 32b parameter model using their ehanced version of OpenDiloco publicly available on their Github.

DeMo

DeMo was published by the inspiring team at Nous Research later in the 2024 year after having teased the public with DisTrO and their preliminary report a few months earlier. DeMo is a unique optimiser that removes the need for global all-reduces and manages to reduce the overall system’s bandwidth communication requirement by several orders of magnitude. It aims to separate faster-moving momentum components from slower ones and only shares a compressed version of those components with all other nodes in the group.

By doing so DeMo, can reduce the amount of data that needs to be shared amongst nodes by many orders of magnitude. The floor to how little data can be shared hasn’t been fully explored yet and in theory the communication requirement could be reduced by more than 1000x. Unlike DiLoCo, DeMo has the benefit of reducing the maximum bandwidth required by each node at the cost of requiring more frequent communication and interconnects.

The incredible team at Templar are live and running with an implementation of DeMo that has had many successful runs pre-training a 1.2b parameter model on Bittensor’s Subnet 3 whilst Nous Research have also launched a testnet on Etherium called Psychewhere they are testing DeMo in a p2p environment.

DeMo + DiLoCo

When reading about these two advancements one would be tempted to ask, can we not leverage both ideas in the same system? The idea being that we could have all nodes in the distributed system running DeMo as an inner optimizer which constantly shares a small percentage of fast moving components with all other nodes. Every H inner steps, all nodes would calculate pseudo gradients between the model state at inner step 0 and the state at inner step H and they would all-reduce those gradients with all other nodes in the system.

In theory, one would assume that such a system would converge much faster than either of these two systems on their own. Follow-on papers from the original authors of DiLoCo seem to be heading in this direction as well, focusing on communication of partial components of the pseudo gradients more frequently to reduce peak bandwidth requirements.

Here at DSTRBTD, we where initially very intrigued by this idea and tested it out in a number of experiments earlier in the year however we found that local model states where regressing after the DiLoCo outer optimizer steps. Now that we have our own version of DiLoCo live we’ll be looking to re-explore this idea in more detail over the coming months.

DiLoCo vs DeMo

The Bandwidth Bottleneck

DiLoCo

DeMo

DeMo + DiLoCo