Tailslayer: Library for reducing tail latency in RAM reads
TL;DR Highlight
This C++ library implements the hedged read technique, which reduces the worst-case latency (tail latency) of RAM reads caused by DRAM refresh timing conflicts by replicating data to independent DRAM channels and writing the result from the first responding channel.
Who Should Read
Developers building high-performance systems (HFT, real-time processing, interrupt handlers, etc.) in C++ where nanosecond to microsecond latency is critical. Low-level systems engineers who tune DRAM-level memory access patterns or are interested in memory architecture.
Core Mechanics
- DRAM periodically performs a 'refresh' operation to maintain data, and if a read request overlaps with this timing, an additional delay (stall) of hundreds of nanoseconds occurs. This is one of the main causes of tail latency.
- Tailslayer replicates the same data across multiple independent DRAM channels and, when a read request arrives, sends reads to all channels simultaneously (hedged read). It reduces the probability of refresh stalls by using the result from the first responding channel.
- The key technique leverages the fact that the refresh schedules between DRAM channels are independent (uncorrelated). Even if one channel is refreshing, other channels are likely to respond normally.
- It reverse engineers 'undocumented' channel scrambling offsets that work on AMD, Intel, and AWS Graviton hardware to control the placement of data in truly independent channels. This is the most technically challenging core aspect.
- The publicly available library currently only supports 2-way replication, but the benchmark code implements N-way replication. Usage involves passing a signal function (returning the index to read) and a work function (processing the read value) as template parameters.
- It features a C++ template-based API, where you create `HedgedReader<T, signal_fn, work_fn>`, insert data, and then call `start_workers()` to have background workers handle the hedged reads. Core pinning is also supported.
- There are clear trade-offs. Replicating data to N channels increases memory usage by up to N times. One comment mentioned that the base load latency itself increases to around 800 cycles, sacrificing median latency in favor of reducing tail latency.
Evidence
- There was feedback that the detailed explanation of how memory addresses are mapped to DRAM channels, ranks, and banks was good, as this low-level information is rarely covered.
- One comment pointed out the cache hit rate issue. Replicating data increases the working set, leading to more cache misses, and questioned whether the performance degradation due to this would offset the benefits of hedged read.
- There was a sharp criticism that the README and header files do not mention the trade-offs at all. It pointed out that the fact that the base load latency is around 800 cycles, meaning the median latency increases significantly, is being hidden. There was also a rebuttal that the statement in the video that 'Graviton has no performance counters' was completely false.
- The IBM zEnterprise platform uses a method of steering loads to non-refreshing banks to completely hide refresh latency, with a space overhead of only 50%. This was a comparative criticism that the Tailslayer approach can waste up to 92% of the space.
- The possibility of applying it to ML model inference was mentioned. Partitioning multiple parallel ML models by channel could guarantee that certain models always read fast data and others always read slow data. However, it was also noted that ML model weights often reside in the L3 cache, which may limit its effectiveness.
How to Apply
- If you are developing a high-frequency trading (HFT) system or a real-time interrupt handler in C++ where nanosecond-level tail latency is critical, you can copy `include/tailslayer` to your project and wrap frequently read small lookup tables with `HedgedReader`. It is suitable for workloads where a 2x memory usage increase is acceptable and reducing tail latency is more important than increasing median latency.
- Before applying it, you should first verify whether DRAM refresh stalls are the cause of tail latency in your actual workload. Most applications have high L1/L2/L3 cache hit rates, so DRAM access is rare. To see the effect of this library, you need a pattern of random access to data that does not fit in the cache.
- For platforms other than AMD/Intel/Graviton, the private channel scrambling offsets may be different. It is safe to first explore the channel mapping of the hardware using the code in the `discovery/` directory and then decide whether to apply it.
Code Example
// #include <tailslayer/hedged_reader.hpp>
// [[gnu::always_inline]] inline std::size_t my_signal() {
// // Return the index to read after waiting for an event
// return index_to_read;
// }
// template <typename T>
// [[gnu::always_inline]] inline void my_work(T val) {
// // Process the read value
// }
// int main() {
// using T = uint8_t;
// // Pin the current thread to the main core
// tailslayer::pin_to_core(tailslayer::CORE_MAIN);
// // Provide the signal function and work function as template parameters
// tailslayer::HedgedReader<T, my_signal, my_work<T>> reader{};
// // Replicate the same data to two DRAM channels
// reader.insert(0x43);
// reader.insert(0x44);
// // Start background workers (handle hedged reads)
// reader.start_workers();
}Terminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.