LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration, Comparison, and Selection

Jan 28, 2026•Shivam Patel, William Cocke, Gauri Joshi•View PDF

TL;DR Highlight

Solving the optimal model selection problem among hundreds of LLMs by 'embedding models as 128-dimensional vectors'

Who Should Read

MLOps/backend engineers wanting to automatically route queries to matching models or find optimal model combinations for cost efficiency. Also useful for teams building model selection automation or LLM routers.

Core Mechanics

Compresses each LLM's per-query correctness (0/1 scores) through an attention layer into a single 128-dimensional 'model embedding' vector — works with just API responses, no model weight access needed
No retraining for new models — evaluate just ~128 queries and the model is immediately placed in the existing embedding space (training-free onboarding)
Embedding distance actually reflects capability differences — Pearson correlation of 0.887 (Euclidean distance) with answer disagreement rate
A portfolio of just 15-20 models can reproduce the routing accuracy of the full 112-model pool — maximizing embedding space coverage via k-center/k-medoids
Embedding distance supports fallback routing — when the selected model is down, substitute with the nearest embedding neighbor, retaining 85% of original routing accuracy
Model fingerprinting application — identical models hidden behind different APIs always converge to the same embedding position when independently sampled

Evidence

LOCUS routing accuracy 64.70% vs EmbedLLM 59.60% vs IRT-Net 63.37% (1024 samples) — same performance with 4.8x fewer training samples
Embedding distance ↔ answer disagreement rate: Pearson 0.887 (Euclidean), Spearman 0.876 — quantitative verification that the space is geometrically meaningful
New model embedding from 128 queries: accuracy difference less than 1% compared to encoder trained on full data
From 112 models (total 1930B params), ~150B parameter budget (8% of total) approaches full-pool routing accuracy

How to Apply

For LLM router construction: evaluate each model on 128-256 benchmark queries → generate embeddings via LOCUS encoder → feed new query encoding and each model embedding to correctness predictor → route to highest-probability model
For model portfolio reduction: apply k-center or k-medoids clustering on all model embeddings → select 15-20 models covering the embedding space → remove remaining redundant models while maintaining routing performance
For failover design: pre-cache top-k neighbor lists based on embedding distance → auto-switch to neighbor #1 when primary model is unavailable (85% accuracy retention)

Code Example

snippet

# LOCUS usage example (conceptual code)
from sentence_transformers import SentenceTransformer

# 1. Query encoding
encoder = SentenceTransformer('all-mpnet-base-v2')
query_embedding = encoder.encode("What is the derivative of x^2?")

# 2. Prepare model evaluation data (query embeddings + correctness labels)
evaluations = [
    {"query_emb": encoder.encode(q), "score": y}
    for q, y in zip(sample_queries, correctness_labels)  # 128~256 samples are sufficient
]

# 3. Generate model embeddings with LOCUS encoder (no retraining required)
# github.com/patel-shivam/locus_code_release
model_embedding = locus_encoder.forward(evaluations)  # shape: (128,)

# 4. Predict correctness probability for new query → routing
scores = {
    model_name: locus_predictor(model_embedding, query_embedding)
    for model_name, model_embedding in model_pool.items()
}
best_model = max(scores, key=scores.get)

# 5. Portfolio reduction: select 15 models via k-center
from sklearn.metrics import pairwise_distances
# Select 15 representative models from embedding matrix using farthest-first sampling

Terminology

Model EmbeddingA number array compressing an LLM's 'capability profile' into a single representation. Like word embeddings represent word meanings as vectors, model embeddings represent a model's strength/weakness patterns as vectors.

Query RoutingAutomatically sending incoming questions to the LLM most likely to answer correctly. Like a call center routing calls based on agent specializations.

Correctness PredictorA lightweight neural network that takes model and query embeddings and outputs 'probability this model answers this question correctly' as 0-1.

Latent Bottleneck AttentionA trick reducing computation when processing thousands of evaluation data points. Places 64 'summary vectors' in between, processing through them instead of directly comparing all data, reducing O(n²) to O(n×64).

Training-free OnboardingAdding a new model to the pool without retraining the entire system. Just pass the new model's evaluation results through the existing encoder to get its embedding.

Model Portfolio SelectionPicking the minimum number of models from hundreds to cover the widest range of capabilities. Like recruiting a balanced team of specialists from each domain.

t-SNEA technique visualizing high-dimensional vectors (128D) as a 2D map. Similar vectors placed nearby, different ones far apart, revealing cluster structure visually.

Related Resources

https://github.com/patel-shivam/locus_code_release

Original Abstract (Expand)

The rapidly growing ecosystem of Large Language Models (LLMs) makes it increasingly challenging to manage and utilize the vast and dynamic pool of models effectively. We propose LOCUS, a method that produces low-dimensional vector embeddings that compactly represent a language model's capabilities across queries. LOCUS is an attention-based approach that generates embeddings by a deterministic forward pass over query encodings and evaluation scores via an encoder model, enabling seamless incorporation of new models to the pool and refinement of existing model embeddings without having to perform any retraining. We additionally train a correctness predictor that uses model embeddings and query encodings to achieve state-of-the-art routing accuracy on unseen queries. Experiments show that LOCUS needs up to 4.8x fewer query evaluation samples than baselines to produce informative and robust embeddings. Moreover, the learned embedding space is geometrically meaningful: proximity reflects model similarity, enabling a range of downstream applications including model comparison and clustering, model portfolio selection, and resilient proxies of unavailable models.