Discussion about this post

User's avatar
Neural Foundry's avatar

This breakdown of multimodal fusion is incredibly well structured and timely. Your distinction between aligment and fusion really clarifies what often gets conflated in research discussions, and the point about fusion happening at multiple levels simultanously resonates with what we're seeing in production systems. The MoS approach is particuarly interesting because it suggests we might get better cross-modal reasoning without needing massive architectural overhauls. I wonder if this kind of state-level mixing could also help with modality-specific biases that plague some current multimodal systems.

Expand full comment

No posts

Ready for more?