Skip to content
Home » News » GPT-4: Details Leaked

GPT-4: Details Leaked

GPT-4: A Closer Look at Parameters, Training, and Inference

https://news.ycombinator.com/item?id=36674905

OpenAI’s latest language model, GPT-4, has been making waves in the AI community with its impressive size and capabilities. In this article, we will delve into the details of GPT-4’s parameters, training process, and inference architecture, shedding light on the advancements made by OpenAI.

Table of Contents

    Parameter Count: Scaling Up Significantly

    GPT-4 boasts a remarkable size, with more than 10 times the number of parameters compared to its predecessor, GPT-3. It is estimated to have approximately 1.8 trillion parameters across 120 layers. This increase in parameter count signifies a substantial leap forward in model size and complexity.

    Mixture of Experts (MoE): Cost-Effective Solution

    To manage the costs associated with such a massive model, OpenAI implemented a mixture of experts (MoE) model in GPT-4. This approach utilizes 16 experts, each consisting of around 111 billion parameters for the MLP (Multi-Layer Perceptron). During each forward pass, two experts are routed, simplifying the routing process compared to more advanced algorithms discussed in the literature.

    MoE Routing: A Simpler Approach

    OpenAI’s routing algorithm for GPT-4 is reportedly straightforward compared to more intricate methods proposed in the research literature. The model employs approximately 55 billion shared parameters for attention, streamlining the routing process.

    Inference Efficiency: Optimized Parameter Usage

    One notable aspect of GPT-4 is its impressive inference efficiency. Each forward pass for generating a single token utilizes only around 280 billion parameters and approximately 560 TFLOPs (tera floating-point operations per second). This stands in stark contrast to the 1.8 trillion parameters and roughly 3,700 TFLOPs required for a purely dense model per forward pass. The optimized parameter usage contributes to improved performance and reduced computational requirements.

    Dataset: Extensive Training Corpus

    GPT-4 is trained on a vast dataset comprising around 13 trillion tokens. It’s important to note that these tokens include both unique tokens and those generated during multiple epochs. The training process involves two epochs for text-based data and four epochs for code-based data. Additionally, OpenAI incorporated millions of rows of fine-tuning data from ScaleAI and internal sources to enhance the model’s performance.

    GPT-4 32K: Fine-Tuning the Context Length

    During the pre-training phase, GPT-4 uses an 8k context length (seqlen). However, the 32k seqlen version of GPT-4 is achieved through fine-tuning the 8k after the initial pre-training. This approach allows the model to handle longer sequences and capture more extensive context.

    Batch Size: Striking a Balance

    OpenAI gradually ramped up the batch size over several days on the cluster, eventually reaching a batch size of 60 million. However, not every expert sees all tokens, resulting in an effective batch size of 7.5 million tokens per expert. The decision to limit the batch size ensures efficient processing and resource utilization.

    Parallelism Strategies: Maximizing GPU Utilization

    To parallelize computations across multiple A100 GPUs, OpenAI implemented 8-way tensor parallelism, as this is the maximum supported by NVLink. Furthermore, they adopted 15-way pipeline parallelism to further optimize GPU utilization. It is speculated that OpenAI employed ZeRo Stage 1 and potentially utilized block-level FSDP (Fully Sharded Data Parallelism) for certain aspects of the training process.

    Training Cost: An Expensive Endeavor

    The training of GPT-4 came with substantial costs. OpenAI’s training FLOPS (floating-point operations per second) for GPT-4 amounted to approximately 2.15e25, utilizing around 25,000 A100 GPUs for 90 to 100 days at a utilization rate of 32% to 36% MFU (Memory Frequency Utilization). Due to numerous failures requiring restarts from checkpoints, the utilization rate remained low. Based on an estimated cloud cost of $1 per A100 hour, the training costs for this run alone would reach approximately $63 million. It is worth noting that pre-training with ~8,192 H100 GPUs could have been done in around 55 days for $21.5 million at a rate of $2 per H100 hour.

    Mixture of Expert Tradeoffs: Finding the Right Balance

    OpenAI made specific tradeoffs in implementing the mixture of experts model. While research has shown that using 64 to 128 experts achieves better loss compared to 16 experts, OpenAI chose a more conservative approach. The decision to stick with 16 experts was driven by the challenges associated with generalization across multiple tasks and the difficulty in achieving convergence with a larger number of experts.

    GPT-4 Inference Cost: Scaling with Complexity

    The cost of GPT-4’s inference is approximately three times higher than that of the 175B parameter GPT-3 model (Davinci). This increased cost primarily stems from the larger clusters required for GPT-4 and the lower utilization achieved during inference. An estimated cost of $0.0049 cents per 1,000 tokens is anticipated for using 128 A100 GPUs to infer GPT-4 with an 8k seqlen. In comparison, the cost reduces to $0.0021 cents per 1,000 tokens when employing 128 H100 GPUs for the same inference task. It should be noted that these estimates assume high utilization and significant batch sizes.

    Multi-Query Attention (MQA): Streamlining Processing

    Like many other models, GPT-4 utilizes Multi-Query Attention (MQA). The introduction of MQA reduces the number of attention heads required, leading to a significant reduction in memory capacity for the Key-Value (KV) cache. However, due to the model’s size, the 32k seqlen GPT-4 cannot run on 40GB A100 GPUs, and the 8k seqlen version has limitations on maximum batch size.

    Continuous Batching: Balancing Latency and Inference Costs

    OpenAI has implemented variable batch sizes and continuous batching in GPT-4’s inference process. This allows for some level of maximum latency while optimizing the overall inference costs. The combination of variable batch sizes and continuous batching ensures efficient resource utilization while maintaining a balance between response times and computational expenses.

    Vision Multi-Modal: Enhancing Capabilities

    GPT-4 incorporates a separate vision encoder alongside the text encoder, featuring cross-attention. The architecture is similar to Flamingo, adding additional parameters on top of the 1.8 trillion present in GPT-4. The vision model undergoes fine-tuning with an additional ~2 trillion tokens after the initial text-only pre-training. The primary purpose of this vision capability is to enable autonomous agents to read web pages, transcribe content from images and videos, and interact with multi-modal inputs.

    Speculative Decoding: Enhancing Inference Efficiency

    There are indications that OpenAI may be employing speculative decoding during GPT-4’s inference process, although this is not confirmed. Speculative decoding involves using a smaller, faster model to generate several tokens in advance. These tokens are then fed into a larger oracle model as a single batch. If the small model’s predictions align with the larger model’s output, several tokens can be decoded in a batch. However, if the larger model rejects the tokens predicted by the smaller model, the rest of the batch is discarded, and the inference continues with the larger model. This approach can optimize efficiency but may impact the quality of generated sequences.

    Inference Architecture: Distributed Computing Power

    GPT-4’s inference is carried out on a cluster comprising 128 GPUs. Multiple clusters are distributed across various data centers in different locations. The model utilizes 8-way tensor parallelism and 16-way pipeline parallelism to effectively leverage the available computational resources. Each node of 8 GPUs handles approximately 130 billion parameters, accommodating the model’s overall size. The design of the architecture allows for efficient parallel processing and scalability.

    Dataset Mixture: A Diverse Training Corpus

    The training process for GPT-4 incorporates a mixture of datasets. The model is trained on approximately 13 trillion tokens, with CommonCrawl and RefinedWeb datasets contributing 5 trillion tokens each. Deduplicating tokens from multiple epochs yields a more reasonable estimate of the “unaccounted for” tokens, often referred to as the “secret” data. Speculations suggest that parts of this data come from sources such as Twitter, Reddit, YouTube, and platforms like LibGen, Sci-Hub, and GitHub.

    GPT-4’s Training and Knowledge: A Reflective Illusion

    GPT-4’s training on an extensive collection of college textbooks, coupled with its ability to answer questions from various domains, creates the illusion of intelligence. This illusion can make it appear knowledgeable in disciplines ranging from computer science to philosophy. Researchers have even attempted to extract memorized parts of books from GPT-4 to better understand its training data. Rumors suggest that the model possesses exceptional familiarity with unique IDs of Project Euler exercises.

    In conclusion, GPT-4 represents a significant leap forward in terms of scale, capabilities, and training techniques. Its massive parameter count, mixture of experts model, and efficient inference architecture contribute to its enhanced performance. While the model’s training process involves a substantial investment, the use of diverse datasets and fine-tuning across multiple domains equips GPT-4 with a broad knowledge base. OpenAI’s approach to scaling language models demonstrates their commitment to pushing the boundaries of AI research and development.

    Leave a Reply

    Your email address will not be published. Required fields are marked *