As an Nvidia product sold in China, the H20 artificial intelligence chip has garnered the most attention.
It is reported that the AI chip developed by Nvidia for the Chinese market has had a weak start with ample supply. Currently, Nvidia is planning to reduce the price of the H20 artificial intelligence chip supplied to the Chinese market.
Nvidia's Chinese market contributed 17% of its revenue in the fiscal year 2024. The flattening price of its AI chips highlights the challenges faced by Nvidia's business in China and casts a shadow over its future in the Chinese market.
The increasing competitive pressure in China has also sounded the alarm for Nvidia investors. After announcing a substantial revenue forecast on May 22, the company's stock price continued its impressive upward momentum.
The H20 is the most powerful among the three AI chips (HGX H20, L20 PCIe, and L2 PCIe) that Nvidia has developed for the Chinese market, but its computing power is lower than that of Nvidia's flagship AI chips H100 and H800, which are also specifically developed for the Chinese market.
Advertisement
Looking at the three models H20, L20, and L2, the H20 should be a training card, while L20 and L2 should be inference cards. The H20 is based on the latest Hopper architecture, while L20 and L2 are based on the Ada architecture.Based on previously disclosed specifications, the H20 has a memory capacity of 96 GB, operates at speeds up to 4.0 Tb/s, and has a computing power of 296 TFLOPs. It uses the GH100 chip, with a performance density (TFLOPs/Die size) of only 2.9. This means that the AI computing power of H20 is less than 15% of that of the H100.
The H20 has a higher cache-to-bandwidth ratio than the Ascend 910B, with the bandwidth being twice that of the 910B. This means that H20 has an advantage in interconnect speed, which determines the speed of data transfer between chips. This implies that in application environments where a large number of chips need to be connected together to work as a whole system, H20 still has competitiveness compared to the 910B, and training large models is just such a scenario.
Currently, the Huawei Ascend community has publicly released three models of the Atlas 300T product, corresponding to the Ascend 910A, 910B, and 910 Pro B, with a maximum power consumption of 300W. The AI computing power of the former two is 256 TFLOPS each, while the 910 Pro B can reach 280 TFLOPS (FP16).
In comparison with the H100, the H100 has 80GB HBM3 memory with a memory bandwidth of 3.4Tb/s, and a theoretical performance of 1979 TFLOP. Its performance density (TFLOPs/Die size) is as high as 19.4, making it the most powerful GPU in Nvidia's current product line.
The H20 has 96GB HBM3 memory with a memory bandwidth of up to 4.0 Tb/s, both higher than the H100, but its computing power is only 296 TFLOP, with a performance density of 2.9, far less than the H100. Theoretically, the H100 is 6.68 times faster than the H20. However, it is worth noting that this comparison is based on the floating-point computing power of FP16 Tensor Core (FP16 Tensor Core FLOPs) and has enabled sparse computing (which greatly reduces the amount of computation, hence the speed will be significantly improved), so it cannot fully reflect all of its computing power.
In addition, the GPU's thermal design power is 400W, lower than the H100's 700W. In the HGX solution (Nvidia's GPU server solution), it can be configured with 8 GPUs. It also retains the 900 GB/s NVLink high-speed interconnect function, and also provides the function of 7 MIGs (Multi-Instance GPU, multi-instance GPU).
H100 SXM TF16 (Sparsity) FLOPS = 1979
H20 SXM TF16 (Sparsity) FLOPS = 296
According to Peta's LLM performance comparison model, H20 has a peak token/second at moderate batch size, which is 20% higher than H100, and the token-to-token latency at low batch size is 25% lower than H100. This is because the number of chips required for inference is reduced from 2 to 1. If 8-bit quantization is used, the LLAMA 70B model can effectively run on a single H20, instead of requiring 2 H100s.
It is worth mentioning that although the computing power of H20 is only 296 TFLOPs, far less than the 1979 of H100, if the actual utilization rate of H20 is MFU (currently the MFU of H100 is only 38.1%), this means that H20 can actually run at 270 TFLOPS, then the performance of H20 in the actual multi-card interconnection environment is close to 50% of H100.
From the perspective of traditional computing, H20 is somewhat downgraded compared to H100. However, in the aspect of LLM inference, H20 is actually more than 20% faster than H100, because H20 is similar to the H200 to be released next year in some aspects. Note that H200 is the successor to H100, a super chip for complex AI and HPC workloads.At the same time, the L20 is equipped with 48 GB of memory and a computing performance of 239 TFLOPs, while the L2 is configured with 24 GB of memory and a computing performance of 193 TFLOPs. The L20 is based on the L40, and the L2 is based on the L4, but these two chips are not commonly used in LLM inference and training.
Both L20 and L2 adopt the PCIe form factor, which is suitable for workstations and servers. Compared to higher-end models such as the Hopper H800 and A800, their configurations are more streamlined.
However, Nvidia's software stack for AI and high-performance computing is so valuable to some customers that they are unwilling to give up the Hopper architecture, even if it means downgrading the specifications.
L40 TF16 (Sparsity) FLOPs = 362
L20 TF16 (Sparsity) FLOPs = 239
L4 TF16 (Sparsity) FLOPs = 242
L2 TF16 (Sparsity) FLOPs = 193
Looking again at the mass production progress of the H200. In March of this year, Nvidia announced that it had begun shipping the cutting-edge image processing semiconductor "H200." The H200 is a semiconductor designed for artificial intelligence, with performance surpassing the current flagship GPU "H100." Nvidia has been launching the latest AI semiconductors with the aim of maintaining a high market share. Subsequently, in April, OpenAI President and Co-founder Greg Brockman revealed on social media X that Nvidia had delivered the world's first DGX H200 to OpenAI, along with a photo of himself with OpenAI CEO Sam Altman and Nvidia CEO Jensen Huang at the delivery site. Brockman stated that this device, built by Huang, "will advance AI, computing, and human civilization." However, Nvidia has not disclosed the price of the GH200.
Leave your email and subscribe to our latest articles
2020 / 3 / 28
2022 / 7 / 16
2022 / 4 / 3