AMD Instinct MI355X

Posted on 14 August, 2025

AMD’s Instinct MI355X is one of the first accelerators to use the new CDNA 4 (originally introduced with the MI350X) architecture, and it marks a decisive step forward in AMD’s strategy for data centre AI and HPC. Rather than being a minor iteration, the MI355X has been engineered from the ground up to address the requirements of large-scale training and inference workloads, as well as mixed AI/HPC applications where both low-precision tensor math and high-precision FP64 performance matter.

Design focus is clear: combine extreme on-package memory capacity with massive memory bandwidth, so that even 400 billion parameter models can be trained with minimal sharding, while at the same time pushing low precision (FP4/FP6/FP8) throughput to new heights. These AI centric improvements do not come at the expense of traditional HPC capability, FP64 throughput and interconnect latency have been significantly strengthened to support scientific computing codes, weather modeling, CFD and simulation workloads.

Architectural Overview

With CDNA 4, AMD has overhauled its Matrix Core engines, adding native support for ultra-low-precision FP4 and FP6 formats. These new cores can execute far more operations per cycle than their predecessors, effectively doubling dense AI throughput compared with CDNA 3 at equivalent clock speeds. Inside each MI355X, eight CDNA 4 compute dies are linked together over fourth-generation Infinity Fabric, forming a single high-bandwidth complex. Surrounding these dies are ten stacks of HBM3E, providing 288 GB of on-package memory and an aggregate bandwidth of 8 TB/s. To keep that bandwidth efficient under irregular access patterns, a 64 MB global L3 cache—Infinity Cache—sits between the fabric and memory controllers, smoothing out bursts and reducing latency for less predictable workloads.

READ WHITEPAPER - INTRODUCING AMD CDNA™ 4 ARCHITECTURE

Key Specifications

GPU	Arch.	Memory (GB)	Memory Bandwidth (TB/s)	Peak AI* (PFLOPS)	FP64 (TFLOPS)	TBP (kW)
MI355X	CDNA 4	288GB HBM3E	8.0	10 (FP8 2:1 sparse)	78.6	1.4
MI300X	CDNA 3	192GB HBM3	5.3	5.2 (FP8 2:1 sparse)	81.7	0.75
B200	Blackwell	192GB HBM3E	8.0	20 (FP8 2:1 sparse)	≈40†	1.0

*Peak AI refers to FP8 throughput with 2:1 sparsity unless otherwise noted.

†NVIDIA has not published B200 FP64 numbers; 40 TFLOPS is an analyst estimate.

Generation-over-Generation Progress

Relative to the MI300X, the MI355X represents a substantial generational leap rather than a small refresh. Memory capacity grows from 192 GB to 288 GB—a 50% increase, meaning models with hundreds of billions of parameters able to stay in-core, reducing offloading requirements. The memory subsystem also gets a matching bandwidth boost, jumping from 5.3 TB/s to 8 TB/s, a 51% uplift that keeps those extra gigabytes fed even under bandwidth-hungry workloads.

On the compute side, dense FP4 and FP6 throughput effectively doubles thanks to CDNA 4’s redesigned matrix cores and more efficient dataflow scheduling. This is particularly important for transformer-based language models, where low-precision math dominates training and inference.

These improvements do come with a higher power envelope however, 750W on the MI300X versus approximately 1.4kW on the MI355X! Efficiency per watt is significantly improved however, even with the extra power draw, inference runs of Llama 3 in FP4 mode show roughly 30 % more tokens per watt than the MI300X. That gain comes from the chip doing more work per joule, in practice this equates to a higher throughput per rack unit and better overall energy economics for dense AI deployments.

Competitive Landscape

The table above compares AMD’s MI355X and MI300X with NVIDIA’s B200 GPU in isolation. However, in reality, NVIDIA’s most relevant production deployment platform is the GB200 Grace-Blackwell Superchip, not the standalone B200

While the B200 is a discrete GPU (similar in scope to MI355X), the GB200 combines a B200 GPU with a (ARM based) Grace CPU through NVLink-C2C in a single module, creating a CPU-GPU superchip with shared memory access and very low interconnect latency. This approach removes the need for an external x86 CPU, giving NVIDIA an integrated solution optimised for large AI and HPC workloads. By contrast, the MI355X is GPU-only, designed to pair with external EPYC CPUs using PCIe or Infinity Fabric. This offers flexibility but means CPU-GPU communication depends on the host.

This is shown in raw FP8 compute, where GB200 inevitably has the edge—the sheer number of B200 dies stitched together gives it very high dense throughput for well-tuned CUDA workloads.

However, the MI355X takes a different approach: each accelerator ships with 288 GB of HBM3e and 8 TB/s of bandwidth. That 50 % bump in memory per device means a single GPU can hold larger model segments, reducing the complexity and synchronisation costs of pipeline and tensor parallelism. For extremely large models—think 400 billion parameters and above—this can be just as important as raw FLOPS. Fewer shards mean less communication overhead, fewer idle bubbles in the training schedule and ultimately higher end-to-end throughput.

For organisations that care about scaling efficiency rather than only peak compute, the MI355X’s generous memory footprint is a key differentiator that can offset GB200’s brute force advantage.

Recommended Workloads

Large-language-model training and inference up to 400 B parameters without off-chip parallelism.
Graph neural networks that benefit from high memory bandwidth and capacity.
Mixed AI/HPC workflows requiring both FP64 and low-precision acceleration (e.g., seismic inversion + transformer surrogate).
Finite-element solvers where 288 GB per GPU reduces checkpointing overhead.

Power, Cooling and Deployment

The MI355X is delivered in the OCP OAM/UBB 2.0 form factor and is designed for direct-to-plate liquid cooling to sustain its full 1.4 kW per module. A standard 8-GPU UBB 2.0 node draws around 11 kW at full power. While liquid cooling is the norm for MI355X, air-cooled variants (MI350X) exist in the same form factor with reduced power limits. Infinity Fabric provides a fully-connected 8-GPU mesh inside the chassis, eliminating the need for external switches for common eight-GPU training topologies.

When it comes to deployment, these servers will require deep cabinets (as well as deep pockets!). The Supermicro AS -4126GS-NMR-LCC (shown below) for example is a 4U server with a depth of nearly 900mm (895.35mm to be exact) supporting dual EPYC CPUs in correlation with 8 onboard MI355X accelerators.

Software Ecosystem

Historically, software has been AMD’s biggest challenge. With the MI355X, ROCm 7 takes a major step forward by expanding framework compatibility: upstream PyTorch and TensorFlow support, ONNX Runtime integration, Triton 3.1 backend support and first-class FP4/FP6 datatypes in HIPRT and MIOpen. The HIP interface continues to ease migration from CUDA, with HIP-NVCC tooling now covering about 92 % of CUDA 12.5 device APIs out-of-the-box. ROCm 7 also adds graph-capture primitives aligned with PyTorch 2.4 and offers ready-to-use container images (PyTorch, TensorFlow, JAX) on Docker Hub within 24 hours of each minor release.

While CUDA still dominates in ecosystem maturity, tooling and pretrained model availability, there is growing recognition that the AI industry cannot remain tied to a single vendor.

ROCm 7 is AMD’s clearest signal yet that the gap is narrowing.

Verdict

With CDNA 4, 288 GB of HBM3E, and ROCm 7, MI355X positions AMD as a credible alternative to NVIDIA’s Blackwell generation. Its chief advantages are superior on-die memory capacity and competitive FP4/FP6 efficiency, balanced against a higher 1.4 kW thermal budget that mandates liquid cooling. For deployments that value memory-per-GPU and double-precision throughput as much as peak FP8 TOPS, MI355X merits serious consideration.

Benchmarks are where the MI355X starts to separate itself from its predecessors and NVIDIA’s latest GPUs. In large language model inference, third-party early testing has shown up to 2x throughput on models like Llama 3.1 405B when using FP4 precision, compared to the NVIDIA B200. While these figures come from limited engineering samples, they hint at the potential efficiency gains from FP4/FP6 datatypes combined with massive on-package memory bandwidth.

Training workloads, such as GPT-style models at scales beyond 100B parameters, have also benefited from the increased memory (288 GB vs. 192 GB on B200). This allows entire model stages to remain resident on a single GPU, significantly reducing pipeline bubbles. Dense FP8 throughput sees roughly a 30% improvement over MI300X, while FP16/FP32 workloads are closer to a 20% uplift.

Where MI355X really shines is in mixed workloads that require a combination of tensor-heavy operations and occasional double-precision math. CFD, seismic modeling and financial Monte Carlo simulations show near linear scaling with 2 GPUs, aided by the upgraded Infinity Fabric interconnect.

Compared to NVIDIA’s B200, the MI355X lags slightly in per-watt efficiency but gains an edge in memory-bound tasks. With large parameter models, MI355X avoids offloading to slower system memory, a factor that often dwarfs raw TFLOPS. For customers training giant LLMs, this alone can offset any slight gap in dense FP8 compute.

Test drives of the MI355X will be available soon via Boston Labs, our onsite R&D and test facility.

The team are ready to enable customers to test-drive the latest technology on-premises or remotely via our fast internet connectivity. Call us on 01727 876100 and one of our experienced sales engineers will happily guide you to your perfect tailored solution and invite you in for a demo or fill out the form below to register your interest and get onto our waiting list.

Author

Hemant Mistry

Solutions Delivery Manager

Boston Limited

Tags: Boston labs, MI355X, AMD, GPU, Supermicro, Liquid Cooled, Direct to Chip, Test Drive, Remote Testing

AMD Instinct MI355X

Architectural Overview

Key Specifications

Generation-over-Generation Progress

Competitive Landscape

Recommended Workloads

Power, Cooling and Deployment

Software Ecosystem

Verdict

Author

Hemant Mistry

Archives

Recent Blogs

Test out any of our solutions at Boston Labs

AMD Instinct MI355X

Architectural Overview

Key Specifications

Generation-over-Generation Progress

Competitive Landscape

Recommended Workloads

Power, Cooling and Deployment

Software Ecosystem

Verdict

Author

Hemant Mistry

Archives

Recent Blogs

Test out any of our solutions at Boston Labs

AMD Instinct MI355X