Explore Core Attention Disaggregation (CAD), a novel architecture decoupling attention operations to enable high-throughput, long-context training on dedicat...
Level: advanced
By Yonghao Zhuang and 8 other authors
Category: research