GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment

This research introduces GRADE-STE, a novel approach replacing policy gradients with direct backpropagation for LLM alignment using Gumbel-Softmax relaxation...

Level: advanced

By Lukas Abrie Nel

Category: research