Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

Sep 3, 2025•Ming Gong, Yingnan Deng, Nia Qi +3•View PDF

TL;DR Highlight

Letting the model decide where and how to insert adapters achieves Full Fine-tuning performance with fewer parameters than LoRA.

Who Should Read

MLOps/research engineers who need to fine-tune LLMs on multiple tasks simultaneously. Especially useful when fixed-structure PEFT like LoRA shows inconsistent performance across tasks and you want a more flexible architecture.

Core Mechanics

Instead of fixing adapter insertion points and paths, a Sigmoid gating function automatically determines 'the probability of inserting an adapter at this layer' during training
In multi-task settings, each task gets independent structural parameters that dynamically route and activate needed paths from a shared adapter pool
Sparsity Regularization (structural complexity penalty) added to the loss function automatically removes unnecessary paths and reduces parameter waste
Optimal performance at λ=1.0. Too high (5.0) removes even essential paths, degrading performance
Maintains MNLI 86%+ even with 15% noise injection — gating deactivates damaged paths for noise resilience
Achieves accuracy equal to or better than Full Fine-tuning (100%) while training only 1.4% of parameters

Evidence

MNLI 87.4%, BoolQ 89.6% — exceeds Full FT (87.2%, 89.5%) with only 1.4% of parameters
Up to 1.5%p accuracy improvement over LoRA (0.85% params, MNLI 86.5%) and Prefix-Tuning (0.5%, 85.9%)
Maintains MNLI 86%+ at 15% noise injection; virtually no performance change below 10% noise
30% fewer parameters than AdapterFusion (2.0% params, 86.8%) with higher accuracy

How to Apply

For multi-task fine-tuning: attach independent structural parameters per task with a shared adapter pool to automatically separate shared/dedicated paths without representation conflicts
Sparsity weight λ tuning: start in the 0.5-1.0 range and monitor validation performance. λ>2.0 risks excessive path pruning and performance degradation
When deploying on noisy real-world data (OCR errors, user typos, etc.), structural gating auto-suppresses damaged paths, reducing the need for separate noise preprocessing

Code Example

snippet

# Structure gating core logic (PyTorch pseudocode)
import torch
import torch.nn as nn

class StructureLearnableAdapter(nn.Module):
    def __init__(self, d_model, r=64):
        super().__init__()
        self.down = nn.Linear(d_model, r)
        self.up = nn.Linear(r, d_model)
        self.act = nn.ReLU()
        # Structure parameter: whether to insert adapter in this layer
        self.gate_param = nn.Parameter(torch.zeros(1))

    def forward(self, h):
        gate = torch.sigmoid(self.gate_param)  # probability between 0~1
        adapter_out = self.up(self.act(self.down(h)))
        return gate * adapter_out + (1 - gate) * h  # soft on/off

# Loss function: task loss + structure sparsity penalty
def total_loss(task_loss, gate_params, lambda_sparse=1.0):
    sparsity_loss = sum(torch.sigmoid(g) for g in gate_params)
    return task_loss + lambda_sparse * sparsity_loss

# Multi-task: separate gating parameters per task
class MultiTaskRouter(nn.Module):
    def __init__(self, num_tasks, num_adapters):
        super().__init__()
        # Adapter combination weights per task
        self.task_gates = nn.Parameter(torch.zeros(num_tasks, num_adapters))

    def forward(self, h, task_id, adapters):
        weights = torch.sigmoid(self.task_gates[task_id])  # (num_adapters,)
        out = h
        for k, adapter in enumerate(adapters):
            out = out + weights[k] * adapter(h)
        return out

Terminology

AdapterA small plug-in module inserted between LLM layers without touching original weights. Like a USB adapter — adds functionality without modifying the original device.

PEFTParameter-Efficient Fine-Tuning. A family of methods that train only a tiny fraction of model parameters to cut costs. LoRA, Adapter, and Prefix-Tuning all fall under this.

LoRAA PEFT technique that trains just two small low-rank matrices instead of the whole model. Enables fine-tuning with 0.1-1% of parameters.

Sparsity RegularizationA technique that adds a 'structural complexity penalty' to the loss function, encouraging the model to automatically disable unnecessary connections or modules. Like automated pruning of dead weight.

Differentiable GatingExpressing whether to activate a module as a probability between 0-1 (rather than binary 0/1) so gradients can flow through. This enables optimizing 'where to insert adapters' via learning.

MNLIMulti-Genre Natural Language Inference. An NLU benchmark task where the model determines the logical relationship (entailment/neutral/contradiction) between two sentences.

BoolQA yes/no question answering benchmark. Tasks involve reading a short passage and answering boolean questions, requiring factual consistency and context understanding.

Original Abstract (Expand)

This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.