● Hugging Face
📅 27/05/2026 à 00:00
Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL
Intelligence Artificielle
🏷️ Tags :
free
Back to Articles Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL Published May 27, 2026 Update on GitHub Upvote 1 Amine Dirhoussi aminediroHF Follow Quentin Gallouédec qgallouedec Follow Kashif Rasul kashif Follow Lewis Tunstall lewtun Follow Edward Beeching edbeeching Follow Albert Villanova del Moral albertvillanova Follow Leandro von Werra lvwerra Follow TL;DR, because you have models to train and we respect that: Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step. It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny. We landed a TRL PR that encodes just the changed elements as a sparse safetensors file, uploads it to a Hugging Face Bucket, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB. The cherry on top: we ran a full disaggregated training where the trainer was on one box, vLLM lived in a Hugging Face Space, the Wordle environment lived in another Space, and weights flowed through a single Hub bucket. No shared cluster, no RDMA, no VPN. Async RL just got a lot cheaper. Read on. Two ways to ship the same weights. Red is wall-clock time during which no tokens are being generated. 1. The One Terabyte Problem If you read our previous post on the landscape of async RL training, you already know the punchline. Every async RL library, regardless of how it spells "actor model" or which color its NCCL backend is painted, eventually trips over the same root: weight synchronization. The inference engine speaks the policy of step N. The trainer just finished step N+1. The fresh weights have to get from one side to the other before the inference engine starts drifting hopelessly off-policy. This sits on the critical path whether you are running sync or async: a blocking transfer is wasted idle compute of GPUs not generating tokens. With a sparse delta path you collapse that idle time into seconds, and the trainer does not even have to wait for the inference engine to be ready: it just publishes "weights ready" and uploads the weights to the shared bucket the moment its optimizer step finishes, while the inference engine fetches on its own time. Fireworks put a very memorable number on this in their post Frontier RL Is Cheaper Than You Think: for a frontier 1T-parameter checkpoint at fp8 (their setting), a full snapshot is 1024 GiB, and that is what conventional wisdom says you have to ship every time you update your rollout fleet. That is the kind of number that gets people to start drawing diagrams with mega-clusters, RDMA fabrics, and dedicated cross-region links. Their measured average delta between adjacent checkpoints lands at 20.3 GiB, or 1.98% of the full model, and "more than 98% of weights in bf16 format remain bit-equivalent between consecutive checkpoints". Cursor's Composer 2 report tells a parallel story. They run training and inference in different regions and stitch them together with a shared S3 bucket (their exact words), into which the trainer uploads compressed weight diffs every training step. Each cluster independently downloads and reconstructs from the shared delta chain, "requiring no direct connectivity to the training cluster". The two sides never speak to each other about parameters directly. The bucket is the wire. Both papers agree on three things, and we want to repeat them slowly, because the rest of this post is essentially a faithful open source translation: Most of the weights have not actually changed between two adjacent RL steps. If you send only the parts that changed, your bandwidth bill collapses by roughly two orders of magnitude. If you route those tiny diffs through a shared object store, you no longer need the trainer and the inference cluster to live in the same data center. The only thing missing was a version of this story that you can pip install. So we wrote one. 2. Why bf16 RL Weights Are Almost Always Sparse Before we wire anything up, it is worth understanding why this whole game is even winnable. The "98% of weights do not change" claim sounds suspiciously like one of those numbers that works in the demo and falls apart in the wild. It is not. It falls out of how bf16 arithmetic works at the learning rates RL uses. A bf16 number has 7 mantissa bits. Between two consecutive powers of two, there are exactly $2^7 = 128$ representable values, so the spacing between adjacent bf16 numbers around $|w|$ is roughly $|w| \cdot 2^{-7}$. An update gets absorbed by the bf16 cast whenever it sits below half of that spacing, i.e., when $|\Delta w| < |w|/256$. This is the "bf16 visibility threshold" PULSE plots in their Figure 3. Now look at what Adam does. At an RL learning rate of, say, $3 \times 10^{-6}$, the update to a single weight is: Δw=−η⋅m^v^+ϵ \Delta w = -\eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon} Δw=−η⋅v^+ϵm^ The normalized step $\hat{m}/(\sqrt{\hat{v}}+\epsilon)$ is roughly order one, so $|\Delta w| \approx \eta \approx 3 \times 10^{-6}$. For most weights, $|w|$ sits somewhere around $10^{-2}$ to $10^{-1}$ (PULSE reports a median of 0.019 for representative LLM weights). The threshold $|w|/256$ at that magnitude is around $4 \times 10^{-5}$ to $4 \times 10^{-4}$, which is bigger than the update. In other words: the optimizer is whispering, and bf16 cannot hear it. The update gets absorbed by rounding, the byte representation of $w$ does not change, and from the inference engine's perspective, this weight did not move. Multiply that by a few hundred million parameters, and you get the >99% sparsity number, for free, with zero approximation. This is exactly the argument made formal in the PULSE paper (Mihai & Belilovsky, 2026). They define two thresholds. The absorption bound $10\eta$ is the conservative worst case for an Adam update, and the effective bound $\eta$ is the regime you actually live in. The bf16 visibility threshold is $|w|/256$. Whenever the update sits below the visibility threshold, it gets absorbed, and the bf16 byte does not change. Their Figure 3 plots both bounds against a cloud of representative LLM weights, and the conclusion is unambiguous: at $\eta = 3 \times 10^{-6}$, the absorption bound itself already sits below the visibility threshold for almost every weight in the model. They measure this empirically across Qwen2.5 (0.5B/1.5B/7B), Llama-3.2-3B, and Gemma-3-4B, and consistently find a mean per-step sparsity of ~99%, with a standard deviation of 0.2 to 0.4% over 400 training steps. The worst-case step stays above 98%. So
🔗 Lire l'article original
👁️ 0 lecture