Unified Multimodal Transformers: Improving Vision-Language Models with Knowledge-Guided Attention Mechanisms

Authors

  • Aderinsola Aderinokun Department of Computer Science, University of Lagos, Nigeria

Abstract

Unified multimodal transformers have revolutionized the field of vision-language models by enabling more robust and efficient integration of visual and textual data. However, there remain challenges in aligning visual and linguistic modalities effectively, particularly in leveraging domain-specific knowledge to enhance model performance. This research paper presents a novel approach that incorporates knowledge-guided attention mechanisms within unified multimodal transformers to improve the fusion of vision and language information. By integrating domain knowledge, our method addresses the limitations of traditional attention mechanisms in capturing complex cross-modal relationships, leading to enhanced performance in tasks such as image captioning, visual question answering, and multimodal sentiment analysis.

Downloads

Published

2024-09-08

How to Cite

Aderinokun, A. (2024). Unified Multimodal Transformers: Improving Vision-Language Models with Knowledge-Guided Attention Mechanisms. MZ Journal of Artificial Intelligence, 1(2). Retrieved from http://mzjournal.com/index.php/MZJAI/article/view/272