Excellent points about a critically undervalued paper!
This paper has received significantly less attention than it deserves, so let me shed a bit more light on it and describe why it's so good:
1. It turns out that the classical U-Net image diffusion backbone, which the entire community has been happily building upon during the past ~3 years (including Stable Diffusion), has severe flaws in its training dynamics. If you track its weights/activations statistics during training, you will observe a steady malignant growth in their magnitudes. Turns out, it impairs convergence and "simply" re-designing the architecture to incorporate a better normalization pipeline improves the performance by a staggering ~2.5 times in terms of image quality.
2. If you've ever trained large neural networks, you might have found yourself ranting about EMA (Exponential Moving Average) parameter updates. This technique involves keeping an exponential moving average of the model weights during training and using this EMA at inference time, throwing away the original network. I think it's one of the most mysterious and unexplored hacks in modern deep learning optimization, significantly influencing final performance (EMA usually yields 2-3 times better quality than the original model itself). Selecting a proper EMA width is pure pain since we know almost no heuristics about it. Apparently, Karras et al. got fed up with this and developed a rigorous strategy on how to store checkpoints in a way that allows you to find the optimal EMA width post-hoc after training is complete. The nicest thing about this new EMA strategy is that it's applicable to any DL model (i.e., not just image diffusion) and, honestly, I would even expect it to be incorporated in some GPT-5 in the future.