Structure-Learnable Adapter Fine-Tuning: 구조 자체를 학습하는 Parameter-Efficient LLM 파인튜닝

Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

Sep 3, 2025•Ming Gong, Yingnan Deng, Nia Qi +3•View PDF

TL;DR Highlight

어댑터를 어디에, 어떻게 꽂을지 모델이 스스로 결정하게 해서 LoRA보다 적은 파라미터로 Full Fine-tuning 수준 성능을 냄.

Who Should Read

LLM을 여러 태스크에 동시에 파인튜닝해야 하는 MLOps/리서치 엔지니어. 특히 LoRA 같은 고정 구조 PEFT가 태스크마다 성능 편차를 보여서 구조 자체를 유연하게 바꾸고 싶은 상황.

Core Mechanics

어댑터 삽입 위치와 경로를 고정하지 않고, Sigmoid 게이팅 함수로 '이 레이어에 어댑터 넣을 확률'을 학습 중에 자동으로 결정함
멀티태스크 환경에서 태스크마다 독립적인 구조 파라미터를 두고, 공유 어댑터 풀에서 필요한 경로만 조합해 활성화하는 동적 라우팅 적용
Sparsity Regularization(구조 복잡도 패널티)을 손실 함수에 추가해, 불필요한 경로를 자동으로 제거하고 파라미터 낭비를 줄임
λ=1.0일 때 최적 성능. λ가 너무 크면(5.0) 필수 경로까지 제거돼서 오히려 성능 하락
노이즈 15% 이하 환경에서 MNLI 86% 이상 유지 — 게이팅이 손상된 경로 비활성화해서 노이즈 저항성 확보
파라미터 1.4%만 학습하면서 Full Fine-tuning(100%)과 비슷하거나 더 높은 정확도 달성

Evidence

MNLI 87.4%, BoolQ 89.6% — Full FT(87.2%, 89.5%)를 파라미터 1.4%로 초과 달성
LoRA(0.85% 파라미터, MNLI 86.5%)와 Prefix-Tuning(0.5%, 85.9%) 대비 정확도 최대 1.5%p 향상
노이즈 주입률 15%에서도 MNLI 86% 이상 유지, 노이즈 10% 이하에서는 성능 거의 변동 없음
AdapterFusion(2.0% 파라미터, 86.8%)보다 파라미터 30% 적으면서 정확도 더 높음

How to Apply

멀티태스크 파인튜닝 시: 각 태스크에 독립 구조 파라미터를 붙이고 공유 어댑터 풀을 두면, 태스크 간 representation 충돌 없이 공유/전용 경로를 자동 분리할 수 있음
Sparsity weight λ 튜닝: 0.5~1.0 범위에서 시작해 검증 성능 모니터링. λ>2.0은 과도한 경로 제거로 성능 저하 위험
노이즈가 많은 실서비스 데이터(OCR 오류, 사용자 입력 오타 등)에 배포할 때, 구조 게이팅이 손상 경로를 자동 억제하므로 별도 노이즈 전처리 부담을 줄일 수 있음

Code Example

snippet

# 구조 게이팅 핵심 로직 (PyTorch 의사코드)
import torch
import torch.nn as nn

class StructureLearnableAdapter(nn.Module):
    def __init__(self, d_model, r=64):
        super().__init__()
        self.down = nn.Linear(d_model, r)
        self.up = nn.Linear(r, d_model)
        self.act = nn.ReLU()
        # 구조 파라미터: 이 레이어에 어댑터 넣을지 말지
        self.gate_param = nn.Parameter(torch.zeros(1))

    def forward(self, h):
        gate = torch.sigmoid(self.gate_param)  # 0~1 확률
        adapter_out = self.up(self.act(self.down(h)))
        return gate * adapter_out + (1 - gate) * h  # 소프트 on/off

# 손실 함수: 태스크 손실 + 구조 희소성 패널티
def total_loss(task_loss, gate_params, lambda_sparse=1.0):
    sparsity_loss = sum(torch.sigmoid(g) for g in gate_params)
    return task_loss + lambda_sparse * sparsity_loss

# 멀티태스크: 태스크별 게이팅 파라미터 분리
class MultiTaskRouter(nn.Module):
    def __init__(self, num_tasks, num_adapters):
        super().__init__()
        # 태스크마다 어댑터 조합 가중치
        self.task_gates = nn.Parameter(torch.zeros(num_tasks, num_adapters))

    def forward(self, h, task_id, adapters):
        weights = torch.sigmoid(self.task_gates[task_id])  # (num_adapters,)
        out = h
        for k, adapter in enumerate(adapters):
            out = out + weights[k] * adapter(h)
        return out

Terminology

AdapterLLM의 원래 가중치는 건드리지 않고 레이어 사이에 끼워 넣는 작은 플러그인 모듈. USB 어댑터처럼 원본 기기 수정 없이 기능을 추가하는 개념.

PEFTParameter-Efficient Fine-Tuning의 약자. 모델 파라미터 전체가 아닌 극히 일부만 학습해서 비용을 줄이는 파인튜닝 방법군. LoRA, Adapter, Prefix-Tuning이 모두 여기에 속함.

LoRA모델 전체 대신 작은 저랭크(low-rank) 행렬 두 개만 학습하는 PEFT 기법. 파라미터 0.1~1% 수준으로 파인튜닝 가능.

Sparsity Regularization손실 함수에 '구조 복잡도 패널티'를 추가해서 불필요한 연결이나 모듈을 자동으로 꺼버리게 유도하는 기법. 다이어트처럼, 쓸모없는 군살을 자동으로 제거.

Differentiable Gating어떤 모듈을 켤지 끌지를 0/1 이진 결정이 아니라 0~1 사이 확률로 표현해서 역전파(gradient)가 흐르게 하는 기법. 덕분에 '어디에 어댑터 넣을지'를 학습으로 최적화 가능.

MNLIMulti-Genre Natural Language Inference. 두 문장의 논리적 관계(함의/중립/모순)를 맞추는 NLU 벤치마크 태스크.

BoolQYes/No 형태의 질의응답 벤치마크. 짧은 지문을 읽고 Boolean 질문에 답하는 태스크로, 사실 일관성과 문맥 이해가 중요함.

Original Abstract (Expand)

This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.