[논문 리딩] VCodeR: Versatile Vision Encoders for Multimodal Large Language Models

AIst 2024. 7. 5. 14:47

2024. 7. 5. 14:47

Reference:

Jain, J., Yang, J., & Shi, H. (2023). VCodeR: Versatile Vision Encoders for multimodal large language models. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2312.14233

Article:

Global issue
- Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks
  - from visual question-answering and image captioning to visual reasoning and image generation

Multimodal Large Language Models는 텍스트 외에도 이미지를 이해하고 생성할 수 있는 능력을 갖춘 언어 모델입니다. 이 모델들은 텍스트와 이미지 간의 상호작용을 학습하여 더 복잡하고 다양한 작업을 수행할 수 있습니다.

Image Captioning이란?
주어진 이미지를 보고 그 이미지의 내용을 설명하는 텍스트를 생성하는 작업 (컴퓨터 비전과 자연어 처리 결합)

Focused issue
- When prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail

GPT-4V은 image에 대한 묘사를 적절히 잘 했으나, counting task에서 오답을 줌.

Key method:
1. Feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities (즉, 이미지가 인풋이며, 아웃풋으로 텍스트인 멀티모달 형태)
2. Use the images from COCO and outputs from off-the-shelf (existing) vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs
  - COST dataset 구성:
    1. Images from COCO
    2. Questions from GPT-4
    3. Segmentation outputs from OneFormer
    4. For the object order perception task:
      - Incorporating depth map output from DINOv2 DPT
3. Introduce metrics to assess the object perception abilities in MLLMs on COST dataset
  - CS, HS, and DS (밑에서 설명할 예정)
Versatile vision enCoder (VCoder)
- LLaVA-1.5 모델(benchmark model)에 VCoder 추가
- DepthCoder / SegCoder: 이미지의 추가 인지 정보 (perception modalities) 전달
  - Depth map과 Segmentation map 이용
- Frozen Parameters: ImCoder, MLP, and LLM:
  - LLaVA-1.5 모델의 구성 요소인 ImCoder, MLP, LLM의 파라미터를 동결(frozen)하여, 원래의 추론 성능을 유지

Evaluation Metrics for Object Identification:
- Compare the object counts in the ground truth and prediction to calculate a count score (CS) and a hallucination score (HS)
  - CS:
    - The percentage of correct object counts predicted by the MLLM with respect to the ground-truth sentence
    - The higher the CS, the better
  - HS:
    - The percentage of extra object counts predicted by the MLLM that do not exist in the ground-truth sentence
    - The lower the HS, the better
- Depth Score (DS):
  - Obtain the position of objects belonging to all categories and then compute the absolute difference using the position values for objects belonging to the same category in the ground truth and prediction

Limitation:
1. Built COST dataset using OneFormer, which can only perceive objects belonging to a limited number of categories due to being trained on a closed-set vocabulary dataset (=> need more classes)
2. One-to-one word matching, requires manually defining a mapping between synonymous words (=> need to explore ways to overcome manually defined synonym mappings)
3. The inaccuracy in the segmentation map may result in the VCoder's failure (=> need to explore ways to reduce the over-dependency on control inputs to handle inaccurate context from the perception modalities)
Trends:
- 뛰어난 모델인 large language model인 chat-GPT가 등장하면서, 멀티모달에 대한 연구가 집중되고 있다. Image에 대한 description은 많은 발전을 이루어 이 논문처럼 개선에 대한 여지만 남겨두고 있다.
- 내가 집중할 수 있는 새로운 연구 주제는 무엇일까? perception modalities을 제공하기 위한 control input (supervised learing)을 두지 않고 모델이 직접 생성하는 과정(unsupervised learning)이 필요하지 않을까?

'Reading Articles > Multimodal' 카테고리의 다른 글

[논문리딩] Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation (0)	2024.07.14
[논문 리딩] Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning (0)	2024.07.09

AIst's Blog

[논문 리딩] VCodeR: Versatile Vision Encoders for Multimodal Large Language Models

'Reading Articles > Multimodal' 카테고리의 다른 글

+ Recent posts

티스토리툴바