Efficient Long-context Language Model Training by Core Attention Disaggregation

Explore Core Attention Disaggregation (CAD), a novel architecture decoupling attention operations to enable high-throughput, long-context training on dedicat...

Level: advanced

By Yonghao Zhuang and 8 other authors

Category: research