Discussion about this post

User's avatar
Neural Foundry's avatar

The framing of residual connections as both enabler and bottlenck is spot on. That constant weight of 1.0 being the core constraint is someting most people overlook when celebrating ResNets. I've been curious how mHC handles the expressivity vs stability tradeoff in practice, especially at scales where convergence gets finicky. If DeepSeek's approach really allows deeper transformations without blowup, it could reshape how we think about model capacity.

No posts

Ready for more?