AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

Mar 12, 2026•Qiyang Li, Rui Kong, Yuchen Li +5•View PDF

TL;DR Highlight

MoE+LoRA combinations slow inference by 2.5x — AdaFuse solves this by fusing all layer adapters in a single CUDA kernel call, achieving 2.4x speedup.

Who Should Read

ML engineers operating LLM serving infrastructure who need to reduce inference latency with LoRA-based multiple adapters or MoE structures. Especially useful for teams fine-tuning and deploying multi-task/multi-domain models.

Core Mechanics

Dynamic adapters (structures that activate different LoRA per input) add only 1-5% parameters but increase inference latency by 250-950% — the problem is the number of CUDA kernel calls, not compute volume
Existing layer-wise/block-wise routing makes routing decisions per layer, creating a structural limitation where adapters can't be pre-fused
AdaFuse uses token-level pre-gating ("route once at the first layer, apply results to all layers") to statically fix execution paths
A custom CUDA kernel called SGMM (Segmented Gather Matrix Multiplication) fuses all activated LoRA across all layers into the backbone in just 1 kernel call
"Fused switching" operations that subtract previous token adapter and add new adapter as tokens change are also handled with 1 SGMM kernel call
2.7x faster than the previously fastest dynamic adapter (PESC) on Llama2-7B and Mistral-7B, with only 29% latency increase over the baseline backbone

Evidence

Llama2-7B baseline 2.4ms/token vs AdaFuse 3.1ms/token (+29%) — compared to MoRAL 8.6ms (+258%) and MOLA 25.3ms (+954%)
Domain-specific task average accuracy: Llama2-7B AdaFuse 83.60%, best baseline MoLA 84.20% — competitive
Mistral-7B domain task average: AdaFuse 87.24% slightly above PESC 87.06% and MoRAL 87.05%
Without SGMM (Simple merge only) 4.2ms/token → with SGMM 3.1ms/token — 26% additional reduction from kernel optimization alone

How to Apply

If you're running multi-task LoRA serving with MoE-LoRA that routes per layer, consolidate the router to a single first layer and propagate results to all layers (pre-gating architecture change)
In inference servers with high adapter switching costs, instead of 2 separate kernel calls (unmarge prev + merge new adapter per token), apply the SGMM pattern: concat and process with a single batched GEMM to dramatically reduce kernel launch overhead
When designing new dynamic adapters, adopting a "decide-once, apply-everywhere" principle — making routing decisions only once upfront regardless of layer depth — makes the inference path static and improves compatibility with existing LLM optimizations like PagedAttention

Code Example

snippet

# AdaFuse token-level pre-gating core logic (conceptual code)
# Existing layer-wise routing:
# for each layer l:
#     gate_l = router_l(hidden_l)  # routing per layer
#     output_l = backbone_l(hidden_l) + sum(gate_l[i] * lora_i(hidden_l))

# AdaFuse pre-gating:
# 1. Route only once at the first layer
gate = router_first_layer(x_first)  # shape: [top_k]
top_k_indices = gate.topk(k=2).indices

# 2. Merge selected LoRAs across all layers at once using SGMM
# fused_down[l] = concat([lora_down[l][i] for i in top_k_indices])
# fused_up[l]   = concat([lora_up[l][i]   for i in top_k_indices])
# SGMM: backbone[l] += fused_down[l] @ fused_up[l]  (all l processed simultaneously)
sgmm_kernel(
    backbone_weights=backbone_weights_all_layers,   # pointer array
    lora_down=fused_down_all_layers,
    lora_up=fused_up_all_layers,
    gates=gate[top_k_indices],
    num_layers=num_layers
)  # single CUDA kernel call completes merge across all layers

# 3. Standard forward pass with merged backbone (no adapter overhead)
for l in range(num_layers):
    hidden = fused_backbone[l](hidden)  # standard matrix multiplication only

Terminology

LoRAInstead of retraining the entire model, add and train just two small matrices. Keeps the original model unchanged and only swaps the "correction layer".

MoEMixture-of-Experts. A structure that selects and uses only a few expert networks matching the input out of many. Like a restaurant calling only the right chef for the cuisine type.

Dynamic AdapterAn adapter that decides in real-time which LoRA to activate based on input tokens. Unlike static adapters that always apply the same correction, chooses different corrections based on context.

CUDA kernelA unit of parallel computation executed on GPU. The act of "calling" a kernel has fixed overhead, so even small computations can be slow if there are many calls.

Pre-gatingDeciding which path to use before processing. Like fully determining your highway route before entering.

GEMMGeneral Matrix Multiplication. The core operation for matrix multiplication on GPU. Most deep learning computations ultimately reduce to GEMM.

PEFTParameter-Efficient Fine-Tuning. A collective term for fine-tuning techniques that train only a subset of parameters to reduce costs. LoRA, Prefix Tuning, etc. belong here.

Original Abstract (Expand)

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.