How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
This research establishes theoretical bounds for transformer optimality in in-context recall, revealing how attention design and parameterization dictate gen...