Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

This research evaluates how rendering long text as visual inputs impacts token efficiency in decoder-based multimodal LLMs, achieving up to 48% reduction in ...

Level: advanced

By Yanhong Li, Zixuan Lan, Jiawei Zhou

Category: research