Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model
TL;DR Highlight
MM-OOD: a framework that adds image+text multimodal reasoning to text-only OOD detection, catching anomalous samples better in zero-shot on top of CLIP
Who Should Read
ML engineers who need to detect out-of-distribution anomaly samples in production CV systems. Particularly useful for building zero-shot OOD detection pipelines without additional training.
Core Mechanics
- Existing EOE uses LLM text reasoning alone to imagine outlier classes; MM-OOD leverages multimodal LLMs like LLaVA for image+text understanding to generate more diverse outliers
- For Near OOD (visually similar classes like dog vs wolf), feeds actual ID images directly into MLLM to extract similar outlier class labels
- For Far OOD (semantically different classes like food vs cars), introduces sketch-generate-elaborate 3-stage framework: text outlier sketch → Stable Diffusion v1.5 OOD image generation → feed generated images back to MLLM for final label refinement
- Directly feeding ID images to MLLM causes inductive bias where MLLM only explores near the ID space — circumvented for far OOD by inputting generated OOD images instead
- Uses GPT-4 to first generate broad categories, then LLaVA-1.5-7B or Qwen2-VL to propose outlier class labels, and CLIP text encoder for ID/OOD classification scoring
- When Near/Far OOD distinction is uncertain in practice, mix results from both branches at 0.5 ratio
Evidence
- Near OOD: ImageNet-10 FPR95 3.84% (EOE 7.01%, 3.17%p improvement; Energy 13.81%, 9.97%p improvement)
- Far OOD average (L=12×K): FPR95 4.33%, AUROC 99.56% — outperforming all comparisons including EOE, MaxLogit, Energy, MCM
- Food-101 dataset with LLaVA: average FPR95 1.12% (EOE 2.22%, ~50% improvement)
- LLaVA-1.5 average FPR95 1.94%, AUROC 99.62% vs EOE's 3.13%, 99.35% (consistent advantage across multiple primary category count M settings)
How to Apply
- To add OOD detection to CLIP-based image classifiers: generate broad categories via GPT-4 → feed ID images + text prompts to LLaVA for outlier class labels → encode both ID/outlier labels with CLIP text encoder → compute detection score as S(x) = max_ID_score - 0.25 * max_OOD_score
- For systems where far OOD matters (medical imaging, autonomous driving): apply sketch-generate-elaborate pattern: (1) sketch text outlier labels with LLM (2) generate corresponding images with Stable Diffusion (3) feed generated images+text to MLLM for final outlier label refinement — both LLaVA and Qwen2-VL work
- When Near/Far OOD type is unknown in advance, generate outlier labels from both branches and mix 0.5:0.5 for the CLIP classifier — works without separate configuration
Code Example
# Example MLLM prompt for Near OOD detection (based on paper Appendix A)
prompt_template = """
Q: Given the image category [{id_class}] and this image,
please suggest visually similar categories that are not directly
related or belong to the same primary group as [{id_class}].
Provide suggestions that share visual characteristics but are
from broader and different domains than [{id_class}].
A: There are {num_outliers} classes similar to [{id_class}],
and they are from broader and different domains than [{id_class}]:
"""
# Far OOD: sketch-generate-elaborate pipeline
def sketch_generate_elaborate(id_labels, mllm, diffusion_model):
# 1. Sketch: draft outlier classes using text only
sketch_labels = mllm(prompt_sketch(id_labels))
# 2. Generate: generate images from representative outlier labels
representative = mllm(prompt_select_representative(sketch_labels))
ood_image = diffusion_model.generate(representative)
# 3. Elaborate: feed generated image into MLLM to refine final labels
final_labels = mllm(prompt_elaborate(id_labels, ood_image))
return final_labels
# CLIP-based OOD detection score computation
import torch
import torch.nn.functional as F
def compute_ood_score(image_feat, text_feats_id, text_feats_ood, beta=0.25):
all_feats = torch.cat([text_feats_id, text_feats_ood], dim=0)
logits = F.cosine_similarity(image_feat.unsqueeze(0), all_feats)
exp_logits = torch.exp(logits)
softmax_all = exp_logits / exp_logits.sum()
K = len(text_feats_id)
id_score = softmax_all[:K].max()
ood_score = softmax_all[K:].max()
return id_score - beta * ood_score # higher value indicates IDTerminology
Original Abstract (Expand)
Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.