TLDR: Researchers propose ‘Hourglass’ (wide-narrow-wide) MLP blocks that invert the traditional ‘narrow-wide-narrow’ design. This new architecture places skip connections in higher-dimensional spaces and uses narrow bottlenecks for residual computation. Experiments show Hourglass MLPs consistently achieve superior performance-parameter trade-offs in generative tasks (classification, denoising, super-resolution) on MNIST and ImageNet-32. Additionally, their input projections can be fixed randomly for efficiency, offering practical advantages. The findings suggest a re-evaluation of skip connection placement in neural networks, with potential applications in Transformers and other residual architectures.
In the evolving landscape of artificial intelligence, Multi-layer Perceptrons (MLPs) have long served as fundamental building blocks for neural networks. Traditionally, these MLP blocks follow a ‘narrow-wide-narrow’ design, where input signals expand into a broader hidden space for processing before contracting back to an output dimension. Skip connections, crucial for stable training and incremental learning, typically operate at these narrower input and output dimensions.
However, a recent research paper titled RETHINKING THE SHAPE CONVENTION OF ANMLP by Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, and Da-shan Shiu from MediaTek Research and National Taiwan University challenges this long-standing convention. They propose an innovative ‘wide-narrow-wide’ MLP block, which they term the ‘Hourglass’ design. This new architecture fundamentally inverts the traditional approach, positioning skip connections to operate within expanded, higher-dimensional spaces, while the core residual computations flow through narrow bottlenecks.
The Hourglass Advantage
The core idea behind the Hourglass MLP is to leverage higher-dimensional spaces for more effective incremental refinement of data representations. Instead of constraining residual updates to the narrower input dimensions, this design allows these updates to occur in a richer, expanded latent space. This is hypothesized to enable more potent learning and refinement, while still maintaining computational efficiency through carefully matched parameter designs.
Implementing the Hourglass MLP requires an initial projection to elevate input signals to these expanded dimensions. Interestingly, the researchers propose that this initial projection can remain fixed at a random initialization throughout the entire training process. This concept, inspired by reservoir computing, offers significant practical benefits, including reduced parameter counts, lower memory bandwidth requirements, and decreased memory capacity needs, without a noticeable impact on performance, especially when the expansion factors are sufficiently large.
Empirical Validation and Key Findings
To validate their hypothesis, the researchers conducted extensive architectural comparisons between conventional and Hourglass MLP stacks. They evaluated both designs on various generative tasks, including generative classification, denoising, and super-resolution, using popular image datasets like MNIST and ImageNet-32.
The results were compelling: Hourglass architectures consistently achieved superior performance-parameter Pareto frontiers across all tested tasks. This means that for a given performance level, Hourglass MLPs required fewer parameters, or for a given parameter budget, they delivered better performance compared to conventional designs. For instance, in an ImageNet-32 denoising task, an Hourglass model achieved 22.31 dB PSNR with 66 million parameters, while the best conventional model needed 75 million parameters for the same score.
Furthermore, the study revealed distinct scaling patterns for optimal Hourglass configurations. As parameter budgets increased, the best-performing Hourglass networks favored deeper structures with wider skip connections and narrower bottleneck dimensions. This contrasts sharply with conventional MLPs, which often rely on shallower depths and very wide hidden layers.
Also Read:
- Exploring the Reach of Logic Gate Networks in Large-Scale Classification
- Unlocking Efficiency in Vision-Language Models: A Theoretical Look at Layer Skipping
Implications for Future Architectures
The findings suggest a significant reconsideration of skip connection placement in modern neural network architectures. The principles demonstrated by the Hourglass MLP could extend beyond simple MLPs to more complex residual networks, including Transformers and U-Net architectures. For Transformers, adapting this ‘wide-narrow-wide’ intuition would involve coordinated modifications to both self-attention and feed-forward layers, potentially leading to more compute-optimal designs with reduced parameter counts.
This research opens up new avenues for designing more efficient and powerful neural networks, pushing the boundaries of what’s possible in deep learning by rethinking fundamental architectural conventions.


