AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
TL;DR Highlight
MoE+LoRA combinations slow inference by 2.5x — AdaFuse solves this by fusing all layer adapters in a single CUDA kernel call, achieving 2.4x speedup.
Who Should Read
ML engineers operating LLM serving infrastructure who need to reduce inference latency with LoRA-based multiple adapters or MoE structures. Especially useful for teams fine-tuning and deploying multi-task/multi-domain models.
Core Mechanics
- Dynamic adapters (structures that activate different LoRA per input) add only 1-5% parameters but increase inference latency by 250-950% — the problem is the number of CUDA kernel calls, not compute volume
- Existing layer-wise/block-wise routing makes routing decisions per layer, creating a structural limitation where adapters can't be pre-fused
- AdaFuse uses token-level pre-gating ("route once at the first layer, apply results to all layers") to statically fix execution paths
- A custom CUDA kernel called SGMM (Segmented Gather Matrix Multiplication) fuses all activated LoRA across all layers into the backbone in just 1 kernel call
- "Fused switching" operations that subtract previous token adapter and add new adapter as tokens change are also handled with 1 SGMM kernel call
- 2.7x faster than the previously fastest dynamic adapter (PESC) on Llama2-7B and Mistral-7B, with only 29% latency increase over the baseline backbone
Evidence
- Llama2-7B baseline 2.4ms/token vs AdaFuse 3.1ms/token (+29%) — compared to MoRAL 8.6ms (+258%) and MOLA 25.3ms (+954%)
- Domain-specific task average accuracy: Llama2-7B AdaFuse 83.60%, best baseline MoLA 84.20% — competitive
- Mistral-7B domain task average: AdaFuse 87.24% slightly above PESC 87.06% and MoRAL 87.05%
- Without SGMM (Simple merge only) 4.2ms/token → with SGMM 3.1ms/token — 26% additional reduction from kernel optimization alone
How to Apply
- If you're running multi-task LoRA serving with MoE-LoRA that routes per layer, consolidate the router to a single first layer and propagate results to all layers (pre-gating architecture change)
- In inference servers with high adapter switching costs, instead of 2 separate kernel calls (unmarge prev + merge new adapter per token), apply the SGMM pattern: concat and process with a single batched GEMM to dramatically reduce kernel launch overhead
- When designing new dynamic adapters, adopting a "decide-once, apply-everywhere" principle — making routing decisions only once upfront regardless of layer depth — makes the inference path static and improves compatibility with existing LLM optimizations like PagedAttention
Code Example
# AdaFuse token-level pre-gating core logic (conceptual code)
# Existing layer-wise routing:
# for each layer l:
# gate_l = router_l(hidden_l) # routing per layer
# output_l = backbone_l(hidden_l) + sum(gate_l[i] * lora_i(hidden_l))
# AdaFuse pre-gating:
# 1. Route only once at the first layer
gate = router_first_layer(x_first) # shape: [top_k]
top_k_indices = gate.topk(k=2).indices
# 2. Merge selected LoRAs across all layers at once using SGMM
# fused_down[l] = concat([lora_down[l][i] for i in top_k_indices])
# fused_up[l] = concat([lora_up[l][i] for i in top_k_indices])
# SGMM: backbone[l] += fused_down[l] @ fused_up[l] (all l processed simultaneously)
sgmm_kernel(
backbone_weights=backbone_weights_all_layers, # pointer array
lora_down=fused_down_all_layers,
lora_up=fused_up_all_layers,
gates=gate[top_k_indices],
num_layers=num_layers
) # single CUDA kernel call completes merge across all layers
# 3. Standard forward pass with merged backbone (no adapter overhead)
for l in range(num_layers):
hidden = fused_backbone[l](hidden) # standard matrix multiplication onlyTerminology
Original Abstract (Expand)
The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.