TLDR: Celestial AI’s Photonic Fabric Appliance (PFA) is a new hardware platform that uses photonics to create a high-bandwidth, low-latency, and energy-efficient shared memory and switching system for AI accelerators. It addresses the limitations of current hardware, such as fixed memory-to-compute ratios, by providing up to 32 TB of shared memory and 115 Tbps of switching. Simulations show significant performance improvements (up to 7.04x throughput for LLM inference) and substantial energy savings (60-90% for LLM training) compared to traditional GPU setups, paving the way for more scalable and efficient AI deployments.
As Artificial Intelligence (AI) models, especially Generative AI, continue to grow exponentially in size, the hardware designed to run them faces significant challenges. Traditional accelerator designs often hit a ‘silicon beachfront constraint,’ limiting the amount of memory directly attached to a processor and creating bottlenecks in data movement. This can lead to higher latency, lower bandwidth, and increased energy consumption, hindering the efficient scaling of large AI workloads like training and inference for Large Language Models (LLMs).
Addressing these critical issues, Celestial AI introduces a groundbreaking solution: the Photonic Fabric™ and the Photonic Fabric Appliance™ (PFA). This innovative platform leverages the power of photonics – using light for data transfer – to create a highly efficient and scalable memory and switching subsystem for AI accelerators.
What is the Photonic Fabric Appliance (PFA)?
The PFA is a rack-mountable system that integrates high-bandwidth HBM3E memory, an on-module photonic switch, and external DDR5 memory within a compact 2.5D electro-optical system-in-package. This unique design allows the PFA to offer an impressive 32 terabytes (TB) of shared memory capacity and a massive 115 terabits per second (Tbps) of all-to-all digital switching capability. Essentially, it creates a vast, shared memory pool that can be accessed by multiple AI processors (XPUs) with unprecedented speed and efficiency.
A core advantage of the Photonic Fabric is its ability to disaggregate memory from compute. This means that instead of being limited by the fixed memory capacity on an individual XPU, processors can tap into a much larger, flexible pool of memory. For instance, an XPU can seamlessly expand its memory capacity from its on-package HBM to up to 2 TB, and even further to 4 TB or 6 TB as more modules are added. This flexibility is crucial for handling the ever-growing memory demands of modern AI models.
Simulated Performance and Energy Savings
To evaluate the PFA’s impact, Celestial AI developed CelestiSim, a lightweight analytical simulator validated against real-world NVIDIA H100 and H200 systems. The simulation results are compelling, demonstrating significant performance improvements and energy savings across various AI workloads.
For LLM inference, the PFA shows remarkable gains:
- For a 405-billion parameter model, it achieves up to 3.66 times higher throughput and 1.40 times lower latency.
- For a projected 1-trillion parameter model, these benefits are even more pronounced, with up to 7.04 times higher throughput and 1.41 times lower latency.
These improvements are largely due to the PFA’s ample memory capacity, which allows for larger batch sizes and eliminates the overhead associated with inter-GPU communication and redundant memory accesses often seen in traditional setups that rely on techniques like tensor parallelism.
Beyond performance, the PFA also delivers substantial energy efficiency. For heavy collective operations in LLM training scenarios, the Photonic Fabric can reduce energy consumption in data movement by 60-90%. This is particularly impactful for bandwidth-intensive operations like tensor parallelism and memory offloading, where the photonic network drastically lowers the energy cost per bit transferred.
The benefits extend to Deep Learning Recommendation Models (DLRM) as well. For tasks like embedding pooling, which involve massive embedding tables and low arithmetic intensity, the PFA demonstrates an average performance improvement of 22.8 times compared to GPUs linked via NVLink. This is attributed to the PFA’s shared storage and low per-bit photonic energy costs.
Also Read:
- GraphTrafficGPT: Advancing Traffic Management with Graph-Based AI
- Automated GPU Code Optimization: Introducing CUDA-L1’s Reinforcement Learning Approach
Looking Ahead
The Photonic Fabric Appliance represents a significant step forward in AI hardware. By integrating advanced photonic technology, Celestial AI is paving the way for more scalable, efficient, and powerful AI deployments. The company plans to further enhance the Photonic Fabric, with future generations expected to increase the number of photonic ports, wavelengths, and per-link data bandwidth, along with support for next-generation memory technologies like HBM4. This continuous innovation aims to mitigate the scaling challenges in AI and foster more efficient hardware-software co-design for large-scale machine learning. You can read more about this research in the paper available here.


