Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

Jan 20, 2026•Haoran Xu, Yanlin Liu, Zizhao Tong +8•View PDF

TL;DR Highlight

MM-OOD: a framework that adds image+text multimodal reasoning to text-only OOD detection, catching anomalous samples better in zero-shot on top of CLIP

Who Should Read

ML engineers who need to detect out-of-distribution anomaly samples in production CV systems. Particularly useful for building zero-shot OOD detection pipelines without additional training.

Core Mechanics

Existing EOE uses LLM text reasoning alone to imagine outlier classes; MM-OOD leverages multimodal LLMs like LLaVA for image+text understanding to generate more diverse outliers
For Near OOD (visually similar classes like dog vs wolf), feeds actual ID images directly into MLLM to extract similar outlier class labels
For Far OOD (semantically different classes like food vs cars), introduces sketch-generate-elaborate 3-stage framework: text outlier sketch → Stable Diffusion v1.5 OOD image generation → feed generated images back to MLLM for final label refinement
Directly feeding ID images to MLLM causes inductive bias where MLLM only explores near the ID space — circumvented for far OOD by inputting generated OOD images instead
Uses GPT-4 to first generate broad categories, then LLaVA-1.5-7B or Qwen2-VL to propose outlier class labels, and CLIP text encoder for ID/OOD classification scoring
When Near/Far OOD distinction is uncertain in practice, mix results from both branches at 0.5 ratio

Evidence

Near OOD: ImageNet-10 FPR95 3.84% (EOE 7.01%, 3.17%p improvement; Energy 13.81%, 9.97%p improvement)
Far OOD average (L=12×K): FPR95 4.33%, AUROC 99.56% — outperforming all comparisons including EOE, MaxLogit, Energy, MCM
Food-101 dataset with LLaVA: average FPR95 1.12% (EOE 2.22%, ~50% improvement)
LLaVA-1.5 average FPR95 1.94%, AUROC 99.62% vs EOE's 3.13%, 99.35% (consistent advantage across multiple primary category count M settings)

How to Apply

To add OOD detection to CLIP-based image classifiers: generate broad categories via GPT-4 → feed ID images + text prompts to LLaVA for outlier class labels → encode both ID/outlier labels with CLIP text encoder → compute detection score as S(x) = max_ID_score - 0.25 * max_OOD_score
For systems where far OOD matters (medical imaging, autonomous driving): apply sketch-generate-elaborate pattern: (1) sketch text outlier labels with LLM (2) generate corresponding images with Stable Diffusion (3) feed generated images+text to MLLM for final outlier label refinement — both LLaVA and Qwen2-VL work
When Near/Far OOD type is unknown in advance, generate outlier labels from both branches and mix 0.5:0.5 for the CLIP classifier — works without separate configuration

Code Example

snippet

# Example MLLM prompt for Near OOD detection (based on paper Appendix A)
prompt_template = """
Q: Given the image category [{id_class}] and this image,
please suggest visually similar categories that are not directly
related or belong to the same primary group as [{id_class}].
Provide suggestions that share visual characteristics but are
from broader and different domains than [{id_class}].

A: There are {num_outliers} classes similar to [{id_class}],
and they are from broader and different domains than [{id_class}]:
"""

# Far OOD: sketch-generate-elaborate pipeline
def sketch_generate_elaborate(id_labels, mllm, diffusion_model):
    # 1. Sketch: draft outlier classes using text only
    sketch_labels = mllm(prompt_sketch(id_labels))
    
    # 2. Generate: generate images from representative outlier labels
    representative = mllm(prompt_select_representative(sketch_labels))
    ood_image = diffusion_model.generate(representative)
    
    # 3. Elaborate: feed generated image into MLLM to refine final labels
    final_labels = mllm(prompt_elaborate(id_labels, ood_image))
    return final_labels

# CLIP-based OOD detection score computation
import torch
import torch.nn.functional as F

def compute_ood_score(image_feat, text_feats_id, text_feats_ood, beta=0.25):
    all_feats = torch.cat([text_feats_id, text_feats_ood], dim=0)
    logits = F.cosine_similarity(image_feat.unsqueeze(0), all_feats)
    exp_logits = torch.exp(logits)
    softmax_all = exp_logits / exp_logits.sum()
    
    K = len(text_feats_id)
    id_score = softmax_all[:K].max()
    ood_score = softmax_all[K:].max()
    
    return id_score - beta * ood_score  # higher value indicates ID

Terminology

OOD (Out-of-Distribution)Samples outside the model's training data distribution. Like showing a cat to a model trained only on dog photos. The situation where unexpected inputs arrive in production.

CLIPAn OpenAI model trained on image-text pairs. Embeds images and text in the same vector space so you can numerically compute 'how similar is this image to the dog description.'

MLLMA large language model that accepts both text and images as input. Models like GPT-4V, LLaVA, Qwen2-VL.

LLaVAAn open-source multimodal LLM. Connects a CLIP vision encoder with a LLaMA language model to see images and answer in text.

FPR95False Positive Rate at 95% True Positive Rate. The rate of incorrectly classifying OOD samples as ID when correctly identifying 95% of actual ID samples. Lower is better, 0% is perfect.

AUROCArea Under the ROC Curve. 1.0 is perfect classification, 0.5 is random. An overall score of how well ID and OOD are distinguished.

Zero-shotRecognizing a class never seen during training. Handling new categories instantly without training data.

CoT (Chain of Thought)A prompting technique where the LLM goes through intermediate reasoning steps instead of giving a direct answer. Improves accuracy on complex problems.

Original Abstract (Expand)

Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.