Technical Challenges to Scale Beyond GPT4 to 100K H100s | NextBigFuture.com
Up until late 2024, no one has been able to massively increase the amount of compute dedicated to a single model beyond the OpenAI GPT 4 model level. This information is from semianalysis and EIA.
Google’s Gemini Ultra, Nvidia Nemotron 340B, and Meta LLAMA 3 405B had similar or slightly more compute than GPT-4, but an inferior architecture was use.d. Those models did not unlock new capabilities.
A 100,000 GPU cluster needs
150MW in datacenter capacity
uses 1.59 terawatt hours in a single year
energy costs $123.9 million at a standard rate of $0.078/kWh
100,000 H100 GPU servers cost $4 billion
OpenAI began training GPT5 around May 2024.
OpenAI’s training BF16 FLOPS for GPT-4 21.5 million ExaFLOPs on ~20,000 A100s for 90 to 100 days. An 100k H100 cluster will have 15-31 times the compute.
A 100k H100 cluster training run for 100 days can reach 600 million ExaFLOPs. The reliability problems for hardware reduces effective compute to 35% of the theoretical level.
To understand network design, topology, reliability concerns, and checkpointing strategies we need to understand how LLM handle data and minimize data movement.
There are 3 different types of parallelism used in trillion parameter training – Data Parallelism, Tensor Parallelism, and Pipeline Parallelism.
Data Parallelism is the simplest form of parallelism in which each GPU holds the entire copy of the model weights and each GPU (rank) receives a different subset of the data. This type of parallelism has the lowest level of communication since just the gradients needs to be summed up (all reduce) between each GPU. This only works if each GPU has enough memory to store the entire model weights, activations, optimizer state. The model weights and optimizer state can take as much as 10.8 Terabytes of memory for training for GPT4.
Tensor parallelism reduces the total memory used per GPU by the number of tensor parallelism ranks. For example, it is common to use 8 tensor parallelism ranks today across NVLink so this will reduce the used memory per GPU by 8.
With Pipeline Parallelism, each GPU only has a subset of the layers and only does the computation for that layer and passes the output other the next GPU.
Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.
Information contained on this page is provided by an independent third-party content provider. This website makes no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact editor @riverton.business