LLM Inference의 에너지 소비 분석과 효율화 최적화 전략

Energy Considerations of Large Language Model Inference and Efficiency Optimizations

Apr 24, 2025•Jared Fernandez, Clara Na, Vashisth Tiwari +3•View PDF

TL;DR Highlight

vLLM + CUDA Graph 조합으로 LLM 추론 에너지를 최대 73%까지 줄일 수 있다는 실증 분석.

Who Should Read

LLM 서빙 인프라를 운영하거나 설계하는 ML 엔지니어 및 DevOps 담당자. 특히 클라우드 비용과 에너지 효율을 동시에 고민하는 팀.

Core Mechanics

vLLM + CUDA Graph 직렬화 조합이 기본 PyTorch 대비 최대 73% 에너지 절감 (BurstGPT, Azure Conversation 데이터셋 기준)
Speculative Decoding(초안 모델로 여러 토큰을 미리 예측하는 기법)은 batch size ≤16에서만 에너지 절감 효과가 있고, batch=128에서는 오히려 25.65% 더 소비함
MoE(Mixture-of-Experts) 아키텍처는 active parameter 수가 같은 dense 모델보다 최대 54.24% 에너지를 더 씀 — fused kernel 비효율 때문
Tensor Parallelism(GPU 여러 장에 모델을 나눠 처리)은 레이턴시는 줄지만 에너지는 오히려 늘어남 (GPU 4장: 레이턴시 -61%, 에너지 +55%)
FLOPs 기반 이론 에너지 추정치는 실제보다 최대 506%나 낮게 잡힘 — 실제 워크로드 반영 안 됨
A100 GPU에서 소프트웨어 최적화 효과가 가장 두드러짐 (PyTorch compile: A100 29.9% vs A6000 1.96%)

Evidence

BurstGPT: PyTorch 대비 이론값 차이 506.52%, vLLM 최적화 후 63.75%로 좁혀짐
Azure Conversation: 최적화로 72.18% 에너지 절감, Azure Code: 37.58% 절감
OLMoE(1B-7B)는 OLMo-1B보다 54.24% 더 많은 에너지 소비, batch=8에서 GEMM 대비 fused kernel 63% 느림
GPU 2장→4장 Tensor Parallelism 시 레이턴시 40.16%/61.34% 감소, 에너지는 29.3%/55.23% 증가

How to Apply

LLM 서빙 스택을 선택할 때 기본 HuggingFace Transformers 대신 vLLM + eager=False(CUDA Graph 활성화)로 바꾸면 같은 하드웨어에서 에너지를 대폭 줄일 수 있음
Speculative Decoding 도입 시 워크로드 batch size를 먼저 확인할 것 — 챗봇처럼 실시간 단건 요청(batch ≤16)에는 유효하지만, 오프라인 배치 처리에서는 오히려 역효과
멀티 GPU 확장이 필요할 때 에너지 효율보다 레이턴시가 목표라면 Tensor Parallelism을 쓰되, 비용/에너지 최소화가 목표라면 단일 GPU + 최적 배칭 전략이 더 효율적

Code Example

snippet

# vLLM으로 에너지 효율적인 LLM 서빙 설정 예시
from vllm import LLM, SamplingParams

# eager=False → CUDA Graph 직렬화 활성화 (에너지 절감 핵심)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enforce_eager=False,       # CUDA Graph 켜기 (기본값)
    gpu_memory_utilization=0.9,
    max_model_len=8192,
)

# Continuous Batching은 vLLM이 자동으로 처리
# 배치 사이즈가 클수록 에너지 효율 좋아짐
sampling_params = SamplingParams(
    temperature=0.0,  # greedy decoding — 에너지 측면에서 beam search보다 효율적
    max_tokens=64,
)

# 여러 요청을 한 번에 처리 (배치 크기 자동 최적화)
outputs = llm.generate(prompts, sampling_params)

# 에너지 측정: CodeCarbon 라이브러리 활용
from codecarbon import EmissionsTracker
tracker = EmissionsTracker()
tracker.start()
# ... 추론 실행 ...
emissions = tracker.stop()
print(f"CO2 배출: {emissions} kg")

Terminology

Speculative Decoding작은 보조 모델이 먼저 여러 토큰을 '초안'으로 예측하고, 큰 모델이 한 번에 검증하는 방식. 혼자 다 만드는 것보다 초안 보고 검토하는 게 빠른 원리와 같음.

Tensor Parallelism모델의 가중치(행렬)를 여러 GPU에 쪼개서 동시에 계산하는 방법. 작업을 여러 명이 나눠 하면 빨라지지만, 조율 비용도 생기는 것처럼 에너지는 오히려 늘 수 있음.

PagedAttentionGPU 메모리를 OS의 가상 메모리처럼 페이지 단위로 관리해 낭비를 줄이는 기법. vLLM의 핵심 기술로, 이 덕분에 더 큰 배치 처리가 가능해짐.

KV CacheTransformer가 이전 토큰들의 계산 결과를 재사용하기 위해 저장해두는 메모리. 생성할수록 이 캐시가 커져서 GPU 메모리를 많이 차지함.

CUDA GraphGPU 연산 순서를 미리 그래프로 캡처해두고 반복 실행 시 오버헤드를 줄이는 기법. 매번 주문서를 다시 쓰지 않고 저장해둔 주문서를 재사용하는 것과 유사.

Continuous Batching출력 길이가 다른 요청들을 처리할 때, 먼저 끝난 자리에 새 요청을 바로 채워 넣는 방식. 버스가 정류장마다 빈 자리 채우며 가는 것과 같음.

MoE (Mixture-of-Experts)입력마다 전체 모델 중 일부 '전문가' 레이어만 선택적으로 활성화하는 아키텍처. 전문가 집단에서 상황에 맞는 사람만 불러 쓰는 것과 비슷하지만, GPU에서는 이 선택 과정이 오히려 비효율을 만들 수 있음.

PrefillLLM이 응답 생성 전에 입력 프롬프트 전체를 한 번에 처리하는 단계. 긴 문서를 읽는 준비 단계로, GPU를 많이 쓰지만 빠름.

Related Resources

Original Abstract (Expand)

As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.