Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
This research evaluates how rendering long text as visual inputs impacts token efficiency in decoder-based multimodal LLMs, achieving up to 48% reduction in ...