URL https://accessible-actress-1a4.notion.site/Towards-Visual-Entailment-Models-that-can-Count-f16d94dd7a7e4317896a661e7eb135e6
Comprehending visual and textual data simultaneously underlies much of the advancement in AI Systems. A lot of work has been done to solve multi-modal problems like Visual Question Answering (VQA) [1,2,3,4], Visual Grounding [5,6], and Image Captioning [7,8].
However, these tasks do not require a fine-grained understanding of the data to solve, i.e., the model can employ shortcuts to solve this problem. For instance, it is known that models use word-only and image-only shortcuts [9] in VQA. The Visual Entailment (VE) [10] task tries to alleviate this problem. Given an image-hypothesis pair, the VE task involves determining whether the given hypothesis entails, contradicts, or is neutral to the given image (ref. Figure 1). Observe that solving the VE task involves propositional logic. This property helps reduce the possible shortcuts the model may undertake.
**Figure 1 - Visual Entailment Dataset**
Zhang et al. [11] showed that soft-attention-based models could not count in VQA. This is because a soft-attention-based model equally distributes the normalized attention scores among the multiple objects. Summing up these scores loses the count information (ref. Figure 2). Our analysis confirms this fact for VE as well: We evaluated the counting ability of OFA (One for All) [2], a state-of-the-art multimodal, multitask pre-trained model on VE, and found that
**Figure 2 - Averaging Effect of Soft Attention**
In this project, we investigate the impact of appending a count module to a soft-attention-based model on the VE task. In particular, we design an object detection-based count module (OCM) and compare it against the baseline of Zhang et al. [11]’s count module (ZCM) on the task of Visual Entailment. We show that OCM performs at par with ZCM on the VE dataset
The rest of the report is structured as follows:
In this section, we start by providing a brief description of Zhang et al.’s count module (ZCM). We will then contextualize our work in relation to other works that focus on the problem of counting in multi-modal data as described above.