● Hugging Face 📅 03/02/2026 à 12:25

Training Design for Text-to-Image Models: Lessons from Ablations

Cybersécurité
Illustration
Back to Articles Training Design for Text-to-Image Models: Lessons from Ablations Team Article Published February 3, 2026 Upvote 69 +63 David Bertoin Bertoin Follow Photoroom Roman Frigg photoroman Follow Photoroom Jon Almazán jon-almazan Follow Photoroom Welcome back! This is the second part of our series on training efficient text-to-image models from scratch. In the first post of this series, we introduced our goal: training a competitive text-to-image foundation model entirely from scratch, in the open, and at scale. We focused primarily on architectural choices and motivated the core design decisions behind our model PRX. We also released an early, small (1.2B parameters) version of the model as a preview of what we are building (go try it if you haven't already 😉). In this post, we shift our focus from architecture to training. The goal is to document what actually moved the needle for us when trying to make models train faster, converge more reliably, and learn better representations. The field is moving quickly and the list of “training tricks” keeps growing, so rather than attempting an exhaustive survey, we structured this as an experimental logbook: we reproduce (or adapt) a set of recent ideas, implement them in a consistent setup, and report how they affect optimization and convergence in practice. Finally, we do not only report these techniques in isolation; we also explore which ones remain useful when combined. In the next post, we will publish the full training recipe as code, including the experiments in this post. We will also run and report on a public "speedrun" where we put the best pieces together into a single configuration and stress-test it end-to-end. This exercise will serve both as a stress test of our current training pipeline and as a concrete demonstration of how far careful training design can go under tight constraints. If you haven’t already, we invite you to join our Discord to continue the discussion. A significant part of this project has been shaped by exchanges with community members, and we place a high value on external feedback, ablations, and alternative interpretations of the results. The Baseline Before introducing any training-efficiency techniques, we first establish a clean reference run. This baseline is intentionally simple. It uses standard components, avoids auxiliary objectives, and does not rely on architectural shortcuts or tricks to save compute resources. Its role is to serve as a stable point of comparison for all subsequent experiments.Concretely, this is a pure Flow Matching (Lipman et al., 2022) training setup (as introduced in Part 1) with no extra objectives and no architectural speed hacks. We will use the small PRX-1.2B model we presented in the first post of this series (single stream architecture with global attention for the image tokens and text tokens) as our baseline and train it in Flux VAE latent space, keeping the configuration fixed across all comparisons unless stated otherwise. The baseline training setup is as follows: Setting Value Steps 100k Dataset Public 1M synthetic image generated with MidJourneyV6 Resolution 256×256 Global batch size 256 Optimizer AdamW lr 1e-4 weight_decay 0.0 eps 1e-15 betas (0.9, 0.95) Text encoder GemmaT5 Positional encoding Rotary (RoPE) Attention mask Padding mask EMA Disabled This baseline configuration provides a transparent and reproducible anchor. It allows us to attribute observed improvements and regressions to specific training interventions, rather than to shifting hyperparameters or hidden setup changes. Throughout the remainder of this post, every technique is evaluated against this reference with a single guiding question in mind: Does this modification improve convergence or training efficiency relative to the baseline? Examples of baseline model generations after 100K training steps. Benchmarking Metrics To keep this post grounded, we rely on a small set of metrics to monitor checkpoints over time. None of them is a perfect proxy for perceived image quality, but together they provide a practical scoreboard while we iterate. Fréchet Inception Distance (FID): (Heusel et al., 2017) Measures how close the distributions of generated and real images are, using Inception-v3 feature statistics (mean and covariance). Lower values typically correlate with higher sample fidelity. CLIP Maximum Mean Discrepancy (CMMD): (Jayasumana et al., 2024) Measures the distance between real and generated image distributions using CLIP image embeddings and Maximum Mean Discrepancy (MMD). Unlike FID, CMMD does not assume Gaussian feature distributions and can be more sample-efficient; in practice it often tracks perceptual quality better than FID, though it is still an imperfect proxy. DINOv2 Maximum Mean Discrepancy (DINO-MMD): Same MMD-based distance as CMMD, but computed on DINOv2 (Oquab et al. 2023) image embeddings instead of CLIP. This provides a complementary view of distribution shift under a self-supervised vision backbone. Network throughput: Average number of samples processed per second (samples/s), as a measure of end-to-end training efficiency. With the scoreboard defined, we can now dive into the methods we explored, grouped into four buckets: Representation Alignment, Training Objectives, Token Routing and Sparsification, and Data. Representation Alignment Diffusion and flow models are typically trained with a single objective: predict a noise-like target (or vector field) from a corrupted input. Early in training, that one objective is doing two jobs at once: it must build a useful internal representation and learn to denoise on top of it. Representation alignment makes this explicit by keeping the denoising objective and adding an auxiliary loss that directly supervises intermediate features using a strong, frozen vision encoder. This tends to speed up early learning and bring the model’s features closer to those of modern self-supervised encoders. As a result, you often need less compute to hit the same quality. A useful way to view it is to decompose the denoiser into an implicit encoder that produces intermediate hidden states, and a decoder that maps those states to the denoising target. The claim is that representation learning is the bottleneck: diffusion and flow transformers do learn discriminative features, but they lag behind foundation vision encoders when training is compute-limited. Therefore, borrowing a powerful representation space can make the denoising problem easier. REPA (Yu et al., 2024) Representation alignment with a pre-trained visual encoder. Figure from arXiv:2410.06940. REPA adds a representation matching term on top of the base flow-matching objective.Let x0∼pdatax_0 \sim p_{\text{data}}x0​∼pdata​ be a clean sample and x1∼ppriorx_1 \sim p_{\text{prior}}x1​∼pprior​ be the noise sample. The model is trained on an interpolated state xtx_txt​ (for t∈[0,1]t \in [0,1]t∈[0,1]) and predicts a vector field vθ(xt,t)v_\theta(x_t, t)vθ​(xt​,t). In REPA, a pretrained vision encoder fff processes the clean sample x0x_0x0​ to produce patch embeddings y0=f(x0)∈RN×Dy_0 = f(x_0) \in \mathbb{R}^{N \times D}y0​=f(x0​)∈RN×D, where NNN is the number of patch tokens and DDD is the teacher embedding dimension. In parallel, the denoiser processes xtx_txt​ and produces intermediate hidden tokens hth_tht​ (one token per patch). A small projection head hϕh_\phihϕ​ maps these student hidden tokens into the teacher embedding space, and an auxiliary loss maximizes patch-wise similarity between corresponding teacher and student tokens: LREPA(θ,ϕ)=−Ex0,x1,t[1N∑n=1Nsim(y0,[n], hϕ(ht,[n]))] \mathcal{L}_{\text{REPA}}(\theta,\phi) = -\mathbb{E}_{x_0,x_1,t}\Big[\frac{1}{N}\sum_{n=1}^{N} \text{sim}\big(y_{0,[n]},\, h_\phi(h_{t,[n]})\big)\Big] LREPA​(θ,ϕ)=−Ex0​,x1​,t​[N1​n=1∑N​sim(y0,[n]​,hϕ​(ht,[n]​))] Here n∈{1,…,N}n \in \{1,\dots,N\}n∈{1,…,N} indexes patch tokens, y0,[n]y_{0,[n]}y0,[n]​ is the teacher embedding for patch nnn, ht,[n]h_{t,[n]}ht,[n]​ is the corresponding student hidden token at time ttt, and sim(⋅,⋅)\text{sim}(\cdot,\cdot)sim(⋅,⋅) is typically cosine similarity. This term is combined with the main flow-matching loss: L=LFM+λ LREPA \mathcal{L} = \mathcal{L}_{\text{FM}} + \lambda\,\mathcal{L}_{\text{REPA}} L=LFM​+λLREPA​ with λ\lambdaλ controlling the trade-off. In practice, the student is trained to produce noise-robust, data-consistent patch representations from xtx_txt​, so later layers can focus on predicting the vector field and generating details rather than rediscovering a semantic scaffold from scratch. What we observed We ran REPA on top of our baseline PRX training, using two frozen teachers: DINOv2 and DINOv3 (Siméoni et al., 2025). The pattern was very consistent: adding alignment improves quality metrics, and the stronger teacher helps more, at the cost of a bit of speed. Method FID ↓ CMMD ↓ DINO-MMD ↓ batches/sec ↑ Baseline 18.2 0.41 0.39 3.95 REPA-Dinov3 14.64 0.35 0.3 3.46 REPA-Dinov2 16.6 0.39 0.31 3.66 On the quality metrics, both teachers improve over the baseline. The effect is strongest with DINOv3, which achieves the best overall numbers in this run. REPA is not free: we pay for an extra frozen teacher forward and the patch-level similarity loss, which shows up as a throughput drop from 3.95 batches/s to 3.66 (DINOv2) or 3.46 (DINOv3). In other words, DINOv3 prioritizes maximum representation quality at the cost of slower training, while DINOv2 offers a more efficient tradeoff, still delivering substantial gains with a smaller slowdown. Our practical takeaway is that REPA is a strong lever for text-to-image training. In our setup, the throughput trade-off is real and the net speedup (time required to reach a given level of image quality) felt a bit less dramatic than what the authors of the paper report on ImageNet-style, class-conditioned generation. That said, the quality gains are still clearly significant. Qualitatively, we also saw the difference early: after ~100K steps, samples trained with alignment tended to lock in cleaner global structure and more coherent layouts, which makes it easy to see why REPA (and alignment variants more broadly) have become a go-to ingredient in modern T2I training recipes. Baseline Repa-DinoV2 Repa-DinoV3 iREPA (Singh et al., 2025) A natural follow-up to REPA is: what exactly should we be aligning? iREPA argues that the answer is spatial structure, not global semantics. Across a large sweep of 27 vision encoders, the authors find that ImageNet-style “global” quality (e.g., linear-probe accuracy on patch tokens) is only weakly predictive of downstream generation quality under REPA, while simple measures of patch-token spatial self-similarity correlate much more strongly with FID. Based on that diagnosis, iREPA makes two tiny but targeted changes to the REPA recipe to better preserve and transfer spatial information: Replace the usual MLP projection head with a lightweight 3×3 convolutional projection operating on the patch grid. Apply a spatial normalization to teacher patch tokens that removes a global overlay (mean across spatial locations) to increase local contrast. Despite representing “less than 4 lines of code”, these tweaks consistently speed up convergence and improve quality across encoders, model sizes, and even REPA-adjacent training recipes. What we observed In our setup, we observed a similar kind of boost when applying the iREPA spatial tweaks on top of DINOv2: convergence was a bit smoother and the metrics improved more steadily over the first 100K steps. Interestingly, the same changes did not transfer as cleanly when applied on top of a DINOv3 teacher and they tended to degrade performance rather than help. We do not want to over-interpret that result: this could easily be an interaction with our specific architecture, resolution/patching, loss weighting, or even small implementation details. Still, given this inconsistency across teachers, we will likely not include these tweaks in our default recipe, even if they remain an interesting option to revisit when tuning for a specific setup. About Using REPA During the Full Training: The paper REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training (Wang et al., 2025) highlights a key caveat: REPA is a powerful early accelerator, but it can plateau or even become a brake later in training. The authors describe a capacity mismatch. Once the generative model starts fitting the full data distribution (especially high-frequency details), forcing it to stay close to a frozen recognition encoder’s lower-dimensional embedding manifold becomes constraining. Their practical takeaway is simple: keep alignment for the “burn-in” phase, then turn it off with a stage-wise schedule. We observed the same qualitative pattern in our own runs. When training our preview model, removing REPA after ~200K steps noticeably improved the overall feel of image quality, textures, micro-contrast, and fine detail continued to sharpen instead of looking slightly muted. For that reason, we also recommend treating representation alignment as a transient scaffold. Use it to get fast early progress, then drop it after a while once the model’s own generative features have caught up. Alignment in the Token Latent Space So far, “alignment” meant regularizing the generator’s internal features against a frozen teacher while treating the tokenizer / latent space as fixed. A more direct lever is to shape the latent space itself so the representation presented to the flow backbone is intrinsically easier to model, without sacrificing the reconstruction fidelity needed for editing and downstream workflows. REPA-E (Leng et al., 2025) makes this concrete. Its starting point is a failure mode: if you simply backprop the diffusion / flow loss into the VAE, the tokenizer quickly learns a pathologically easy latent for the denoiser, which can even degrade final generation quality. REPA-E’s fix is a two-signal training recipe: keep the diffusion loss, but apply a stop-gradient so it only updates the latent diffusion model (not the VAE); update both the VAE and the diffusion model using an end-to-end REPA alignment loss. Thanks to these two tricks, the tokenizer is explicitly optimized to produce latents that yield higher alignment and empirically better generations. In parallel, Black Forest Labs’ FLUX.2 AE work frames latent design as a trade-off between learnability, quality, and compression.Their core argument is that improving learnability requires injecting semantic structure into the representation, rather than treating the tokenizer as a pure compression module. This motivates retraining the latent space to explicitly target “better learnability and higher image quality at the same time". They do not share the full recipe, but they do clearly state the key idea: make the AE’s latent space more learnable by adding semantic or representation alignment, and explicitly point to REPA-style alignment with a frozen vision encoder as the mechanism they build on and integrate into the FLUX.2 AE. What we observed To probe alignment in the latent space, we compared two pretrained autoencoders as drop-in tokenizers for the same flow backbone: a REPA-E-VAE (where we do add the REPA alignment objective, as in the paper) and the Flux2-AE (where we do not add REPA, following their recommendation). The results were, honestly, extremely impressive, both quantitatively and qualitatively. In samples, the gap is immediately visible: generations show more coherent global structure and cleaner layouts, with far fewer “early training” artifacts. Method FID ↓ CMMD ↓ DINO-MMD ↓ batches/sec ↑ Baseline 18.20 0.41 0.39 3.95 Flux2-AE 12.07 0.09 0.08 1.79 REPA-E-VAE 12.08 0.26 0.18 3.39 A first striking point is that both latent-space interventions lower the FID by ~6 points (18.20 to ~12.08), which is a much larger jump than what we typically get from “just” aligning intermediate features. This strongly supports the core idea: if the tokenizer produces a representation that is intrinsically more learnable, the flow model benefits everywhere. The two AEs then behave quite differently in the details. Flux2-AE dominates most metrics (very low CMMD and DINO_MMD, but it comes with a huge throughput penalty: batches/sec drops from 3.95 to 1.79. In our case this slowdown is explained by practical factors they also emphasize: the model is simply heavier, and it also produces a larger latent (32 channels), which increases the amount of work the diffusion backbone has to do per step. REPA-E-VAE is the “balanced” option: it reaches essentially the same FID as Flux2-AE while keeping throughput much closer to the baseline (3.39 batches/sec). Baseline Flux2-AE REPA-E-VAE Training Objectives: Beyond Vanilla Flow Matching Architecture gets you capacity, but the training objective is what decides how that capacity is used. In practice, small changes to the loss often have outsized effects on convergence speed, conditional fidelity, and how quickly a model “locks in” global structure. In the sections below, we will go through the objectives we tested on top of our baseline rectified flow setup, starting with a simple but surprisingly effective modification: Contrastive Flow Matching. Contrastive Flow Matching (Stoica et al., 2025) Flow matching has a nice property in the unconditional case: trajectories are implicitly encouraged to be unique (flows should not intersect). But once we move to conditional generation (class- or text-conditioned), different conditions can still induce overlapping flows, which empirically shows up as “averaging” behavior: weaker conditional specificity, and muddier global structure. Contrastive flow matching addresses this directly by adding a contrastive term that pushes conditional flows away from other flows in the batch. Contrastive flow matching makes class-conditional flows more distinct, reducing overlap seen in standard flow matching, and produces higher-quality images that better represent each class. Figure from arXiv:2506.05350. For a given training triplet (x,y,ε)(x, y, \varepsilon)(x,y,ε), standard conditional flow matching trains the model velocity vθ(xt,t,y)v_\theta(x_t,t,y)vθ​(xt​,t,y) to match the target transport direction. Contrastive flow matching keeps that positive term, but additionally samples a negative pair (x~,y~,ε~)(\tilde{x}, \tilde{y}, \tilde{\varepsilon})(x~,y~​,ε~) from the batch and penalizes the model if its predicted flow is also compatible with that other trajectory. In the paper’s notation, this becomes: LΔFM(θ)=E[∥vθ(xt,t,y)−(α˙tx+σ˙tε)∥2 − λ∥vθ(xt,t,y)−(α˙tx~+σ˙tε~)∥2] \mathcal{L}_{\Delta \text{FM}}(\theta) = \mathbb{E}\Big[ \|v_\theta(x_t,t,y)-(\dot{\alpha}_t x+\dot{\sigma}_t\varepsilon)\|^2 \;-\; \lambda \|v_\theta(x_t,t,y)-(\dot{\alpha}_t \tilde{x}+\dot{\sigma}_t\tilde{\varepsilon})\|^2 \Big] LΔFM​(θ)=E[∥vθ​(xt​,t,y)−(α˙t​x+σ˙t​ε)∥2−λ∥vθ​(xt​,t,y)−(α˙t​x~+σ˙t​ε~)∥2] where λ∈[0,1)\lambda\in[0,1)λ∈[0,1) controls the strength of the “push-away” term. Intuitively: match your own trajectory, and be incompatible with someone else’s. The authors show that contrastive flow matching produces more discriminative trajectories and that this translates into both quality and efficiency gains: faster convergence (reported up to 9× fewer training iterations to reach similar FID) and fewer sampling steps (reported up to 5× fewer denoising steps) on ImageNet (Deng et al. 2009) and CC3M(Sharma et al., 2018) experiments. A key advantage is that the objective is almost a drop-in replacement: you keep the usual flow-matching loss, then add a single contrastive “push-away” term using other samples in the same batch as negatives which provides the extra supervision without introducing additional model passes. What we observed Method FID ↓ CMMD ↓ DINO-MMD ↓ batches/sec ↑ Baseline 18.20 0.41 0.39 3.95 Contrastive-FM 20.03 0.40 0.36 3.75 On this run, contrastive flow matching yields a s
← Retour