ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

ACCO introduces a novel approach to distributed sharded LLM training by synchronizing delayed gradients to minimize communication overhead and GPU idle time,...

Level: advanced

By Unknown

Category: research