Abstract: Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results