Vision-Language Model (VLM)
A model that processes both images and text, enabling tasks like image captioning, visual question answering, and document understanding. VLMs extend language models with visual perception by encoding images into the same representation space as text.