Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent
This research elucidates how transformers leverage softmax attention to perform in-context classification via kernel gradient descent, offering a theoretical...