Energy Considerations of Large Language Model Inference and Efficiency Optimizations

Apr 24, 2025•Jared Fernandez, Clara Na, Vashisth Tiwari +3•View PDF

TL;DR Highlight

Empirical analysis showing that vLLM + CUDA Graph can cut LLM inference energy consumption by up to 73%.

Who Should Read

ML engineers and DevOps teams operating or designing LLM serving infrastructure. Especially teams concerned with both cloud costs and energy efficiency.

Core Mechanics

vLLM + CUDA Graph serialization achieves up to 73% energy reduction vs. vanilla PyTorch (tested on BurstGPT, Azure Conversation datasets)
Speculative Decoding only saves energy at batch size ≤16; at batch=128 it actually increases energy by 25.65%
MoE models (OLMoE 1B-7B) consume 54.24% more energy than dense models (OLMo-1B) at batch=8 due to fused kernel overhead
Energy optimization and throughput optimization are not always aligned — different strategies needed depending on workload

Evidence

BurstGPT: gap from theoretical optimal reduced from 506.52% (PyTorch) to 63.75% (vLLM optimized)
Azure Conversation: 72.18% energy reduction; Azure Code: 37.58% reduction with optimization
OLMoE (1B-7B) consumes 54.24% more energy than OLMo-1B; at batch=8, fused kernels vs GEMM show significant overhead

How to Apply

When choosing your LLM serving stack, switch from vanilla HuggingFace Transformers to vLLM with eager=False (CUDA Graph enabled) for significant energy savings on the same hardware.
Before adopting Speculative Decoding, check your workload batch size first — it helps for real-time single requests (batch ≤16) like chatbots, but hurts at high batch sizes for offline processing.
For MoE models, profile energy consumption separately — they may use more energy than dense models of similar active parameter count.

Code Example

snippet

# Example setup for energy-efficient LLM serving with vLLM
from vllm import LLM, SamplingParams

# eager=False → Enable CUDA Graph serialization (key to energy savings)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enforce_eager=False,       # Enable CUDA Graph (default)
    gpu_memory_utilization=0.9,
    max_model_len=8192,
)

# Continuous Batching is handled automatically by vLLM
# Larger batch sizes lead to better energy efficiency
sampling_params = SamplingParams(
    temperature=0.0,  # greedy decoding — more energy-efficient than beam search
    max_tokens=64,
)

# Process multiple requests at once (batch size auto-optimized)
outputs = llm.generate(prompts, sampling_params)

# Energy measurement: using the CodeCarbon library
from codecarbon import EmissionsTracker
tracker = EmissionsTracker()
tracker.start()
# ... run inference ...
emissions = tracker.stop()
print(f"CO2 emissions: {emissions} kg")

Terminology

Speculative DecodingA small draft model predicts multiple tokens ahead, then the large model verifies them in one pass. Faster than generating everything from scratch, like reviewing a draft vs. writing from scratch.

Tensor ParallelismSplitting model weights (matrices) across multiple GPUs for simultaneous computation. Like dividing work among multiple people — faster but coordination overhead exists.

Related Resources

Original Abstract (Expand)

As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.