Unsolvable Problem Detection - Evaluating Trustworthiness of Vision Language Models
About the paper
- Project page: https://github.com/AtsuMiyai/UPD/
Motivation:
- With the development of powerful foundation Vision-Language Models (VLMs) such as the LLaVA-1.5 model, we can now solve visual question-answering (VQA) quite well by simply plugging the foundation VLMs as zero-shot learners (i.e., no need for fine-tuning on the VQA task).
- However, similar to the hallucination in LLMs, when the LLMs confidently provide false answers, the VLMs also face the hallucination problem when they always provide answers from a given answer set even when these questions are unsolvable (a very important point: unsolvable with respect to a given answer set).
- To systematically benchmark the problem, the authors proposed three new challenges/types of unsolvable problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD).
- Note that these problems are not open-ended problems, but closed ones, i.e., the answer is limited within a given answer set. Moreover, the answer set designed by the authors does not completely cover the answer space, i.e., lacking answers like “None of the above” or “I don’t know,” making the problem become unsolvable with respect to the given answer set.
- With the proposed unsolvable problems, the VLMs like GPT-4 likely provide hallucination answers, so that the authors can evaluate the trustworthiness of the VLMs.
Introduction
Hallucination in LLMs
is a well-known problem in which the model generates coherent but factually incorrect information or are disconnected from the question.
Example 1: Question “How many letters are “I” in the word “Apple”?” -> LLMs: “There are 3 letter “I” in the word “Apple” (Factually incorrect)
Example 2: Question “What is the color of the sky?” -> LLMs: “The ocean is blue” (Disconnected to the question)
Visual Question Answering (VQA) is a challenging task in which a model generates natural language answers to questions about a given image. The question is usually “open-ended,” which means the answer is not limited to a fixed set of answers. Therefore, the model needs to understand both visual information from the image and textual information from the question.
Comparing VQA to the Question Answering (QA) task in NLP, VQA is more challenging because the model needs to understand both visual and textual information. However, it is less challenging in terms of the search space to find the answer. In VQA, the ground truth is limited to the visual information of the given image, whereas in QA, the answer can be any information in world knowledge.
Given this little background, we can see that the three types of problems proposed by the authors are not “open-ended” problems, but closed ones, i.e., the answer is limited within a given answer set. Moreover, the answer set designed by the authors does not completely cover the answer space, i.e., lacking answers like “None of the above” or “I don’t know,” making the problem become unsolvable with respect to the given answer set.
Fortunately, the authors are also aware of this limitation and discuss it as the training-free solutions: Adding additional options (e.g., “None of the above”) to the answer set or adding an instruction to withhold an answer when the model is not confident (e.g., “If all the options are incorrect, answer F. None of the above” or “If the given image is irrelevant to the question, answer F. The image and question are irrelevant”).
Benchmarking
Three types of accuracy:
- Standard Accuracy: The accuracy on standard VQA task where the correct answer is in the answer set.
- Unsolvable Accuracy: The accuracy on the proposed unsolvable problems where the correct answer is not in the answer set. The model should not provide any answer in this case. With training-free approaches, there are additional other options in the answer set, the model should choose these options.
- Dual Accuracy: The accuracy on standard-UPD pairs, where we count success only if the model is correct on both the standard and UPD questions (i.e., if the model cannot answer the standard question correctly, we do not need to evaluate the UPD question).
Some interesting observations (IMO):
- Original Standard accuracy is high, showing that the VLMs can solve the VQA task well.
- Dual Accuracy (Base) is low, showing that the VLMs are not good at detecting unsolvable problems. It is also worth noting that, the authors used the prompt ““Answer with the option’s letter from the given choices directly.” to explicitly tell the model to choose the answer from the given answer set. Therefore, it is not surprising that the Dual Accuracy is low.
- The Dual Accuracy (Base) of the LLaVA and GPT-4V is not quite bad, showing that the models can detect unsolvable problems to some extent (without any additional approaches/aids). Given the fact that the models are asked explicitly to choose the answer from the given answer set, showing that the model GPT-4V can ignore the instruction to provide correct answers not from the answer set is quite interesting.
- Adding instruction helps the model to detect unsolvable problems better, however, reducing the model’s performance on the standard VQA task (Section 5.3).
Enjoy Reading This Article?
Here are some more articles you might like to read next: