Currently, the cost barrier to training state of the art language models is extremely high. GPT-4 is suspected to have cost more than $100 million to train whilst Anthropic’s CEO predicts that training SOTA models will cost $1 billion dollars this year followed by $10 billion dollars next year. This means that only a small oligarchy of well-funded tech giants now have the ability to train these models.
As these models grow more intelligent and have more impact on our daily lives, so do their owners. They end up deciding how they should be censored and whose values they should incorporate. In effect, this means we get to be governed by AI trained on a constitution we never voted for.
Blockchain, the Decentralised movement and more specifically Bittensor have proved that they can provide alternatives to this centralised approach by incentivising the masses to pool their resources together to carry out useful work. As the co-founder of Bittensor, Const, often mentions, the collective amount of compute that goes into mining Bitcoin far exceeds the compute of any Google, Microsoft, OpenAI or Anthropic data centres.
Granted, Machine Learning requires a different type of compute but if a decentralised mechanism is able to incentivise that specific type of compute in a similar way whilst accurately validating it then in theory it can have access to a similar size of compute, if not larger, to train a model with a trillion parameters.
Our proposed solution is a subnetwork that incentivises Compute, Bandwidth and Latency. The compute helps power the training of a miner’s local version of a model and the bandwidth and latency helps power the averaging of each miners local model weights using an operation called butterfly all-reduce. Once this process is successfully completed, each miner has a unified global averaged gradient that it can use to update it’s model weights.
This particular synapse has required the bulk of our attention for the past few months. It is, in our opinion, the most unstable part within distributed training. It requires all miners to split their gradients into n batches where n is the number of miners in the all reduce group. It then shares those batches of gradients with each miner and averages the gradients it receives.
If some miners fail to send their averaged gradients, then local gradients are used instead.
If some miners send gradients for the wrong model, then they are banned and their gradients are disregarded.
If miners are too slow to send their gradients, then they are also banned and local gradients are used instead.
Stabilising this synapse call will be the main aim of this subnet for the short-term future. Initially, we will target this by setting a high minimum bandwidth requirement. We will then aim to improve the ease of discovering all reduce groups and nodes within that group. We will then aim to upgrade this synapse to detect and penalise miners with slow gradient communication.
During distributed training, node failures are almost inevitable. Especially during the all-reduce phase as this requires very high bandwidth. If a node fails to run an all-reduce operation, its local model is out of sync with the global model. A mechanism is therefore needed to enable an out of sync node to pull the latest model.
Hivemind, attempts to achieve this in a decentralised way by enabling any node to call the function load_state_from_peer
which enables it to load the latest model state from any available peer in the network.
From our experience testing this function for the past 4 months on Bittensor's testnet 80, we’ve found this function to be unreliable and an increased strain on nodes’ bandwidth. It also runs a number of background processes that can sometimes cause nodes to unexpectedly hang or crash.
As a result, this subnet, instead, asks the first validator that is able to have a version of the latest model state, to upload this to a shared HuggingFace repo. We effectively sacrifice a bit of de-centralisation (for now) in return for network stability. This an interim solution designed to prioritise network stability. Once this is achieved, the aim will be to replace this with a more robust de-centralised option.
Local Samples Accumulated & Local Epoch:
Each neuron keeps track of its own local epoch and local samples accumulated.
Global Samples Accumulated:
Each validator logs the number of samples it has verified for each miner over time on WandB. The sum of all the validators number of verified samples is the Global Samples Accumulated.
Global Epoch:
The first validator to run a successful all-reduce with it’s pooled miners increments the Global Epoch on WandB by 1 and uploads the new model weights to hugging face.
Karim Foda
Mikkel Loose