Interpretable Multimodal Transformers: Bridging the Gap Between Visual and Textual Representations

Aderinsola Aderinokun

Authors

Aderinsola Aderinokun Department of Computer Science, University of Lagos, Nigeria

Abstract

With the increasing prevalence of multimodal data in various applications, such as image captioning, visual question answering, and multimedia content analysis, the need for interpretable models that can effectively bridge visual and textual representations has become critical. This paper presents a comprehensive review of interpretable multimodal transformers, exploring their architecture, mechanisms, and the challenges associated with integrating visual and textual information. We discuss recent advancements in the field, evaluate existing methods, and propose directions for future research to enhance the interpretability and effectiveness of multimodal transformers.

Interpretable Multimodal Transformers: Bridging the Gap Between Visual and Textual Representations

Authors

Abstract

Downloads

Published

Issue

Section

License

Current Issue

Information