This research introduces GRADE-STE, a novel approach replacing policy gradients with direct backpropagation for LLM alignment using Gumbel-Softmax relaxation...
Level: advanced
By Lukas Abrie Nel
Category: research