Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum
Explore variance-adaptive variants of the Muon optimizer, Muon-NSR and Muon-VS, designed to accelerate LLM pretraining through orthogonal momentum updates an...