Interpretable Multimodal Transformers: Bridging the Gap Between Visual and Textual Representations

Authors

  • Aderinsola Aderinokun Department of Computer Science, University of Lagos, Nigeria

Abstract

With the increasing prevalence of multimodal data in various applications, such as image captioning, visual question answering, and multimedia content analysis, the need for interpretable models that can effectively bridge visual and textual representations has become critical. This paper presents a comprehensive review of interpretable multimodal transformers, exploring their architecture, mechanisms, and the challenges associated with integrating visual and textual information. We discuss recent advancements in the field, evaluate existing methods, and propose directions for future research to enhance the interpretability and effectiveness of multimodal transformers.

Downloads

Published

2024-07-15