DiLoCo vs DeMoAs models scale to billions of parameters, communication overhead becomes the key barrier to scaling distributed training systems. This post explores two groundbreaking solutions, DiLoCo and DeMo, that challenge the need for full synchronization and drastically reduce bandwidth demands by up to 1000x. We also share early results from experiments combining both techniques.