Cost Saving Strategies for Training Large Language Models like ChatGPT / GPT4

Training large language models like ChatGPT or GPT-4 is an expensive task that requires a lot of computational resources, time, and money. In this blog post, we will discuss some cost-saving strategies that can be used while training these models.

Why Training Large Language Models is So Expensive?

Large language models are expensive to train because they require vast amounts of computational resources, such as GPU/CPU, memory, and storage. Training a language model on a single machine can take days or even weeks, depending on the size of the dataset and the complexity of the model. The cost of renting these resources from cloud service providers like Amazon Web Services (AWS) can be quite high. To put it in perspective, training GPT-3, which has 175 billion parameters, would cost approximately $4.6 million on AWS.

Spot Instances as a Solution

AWS offers a cost-effective solution to reduce the cost of training large language models – Spot instances. Spot instances are unused EC2 instances that are available for purchase at a significantly lower price than on-demand instances. AWS offers these instances at a discount of up to 90% compared to on-demand instances. Spot instances are a great option for training large language models since the training process is not time-sensitive, and the task can be divided into smaller chunks.

How to Use Spot Instances

To use Spot instances, you need to request them using the AWS Management Console or the AWS SDK. Spot instances are available until the price exceeds your maximum bid, at which point the instance is terminated. However, using spot instances has its challenges. The instance can be terminated at any time, causing the work to be lost, and the training process to start from scratch.

Creating a Boot Snapshot Volume

To prevent losing work when a Spot instance is terminated, you can create a boot snapshot volume. A boot snapshot volume contains the operating system, the application, and any data that was present on the instance. When a new instance is created, this volume can be attached, allowing you to resume work from where you left off. By doing this, you can significantly reduce the time and money spent on training large language models.

Challenges with Spot Instances and GPU Limits

While Spot instances are a cost-effective solution for training large language models, AWS and other cloud companies impose limits on the type of spot instances that have GPU, making it challenging to procure the necessary resources to train these models. These limits can significantly impact the training time and, consequently, the overall cost of the project.

Requesting a Limit Increase

To overcome this challenge, you can request a limit increase for the GPU instances from AWS or the cloud provider you are using. The process involves submitting a support ticket requesting an increase in the limit for the specific instance type. AWS has a dedicated portal for requesting limit increases, making the process more streamlined.

Tips for Getting Your Limit Increase Approved

Getting a limit increase approved can be a daunting task, but there are ways to increase the chances of approval. Here are some tips to follow:

Justify your request – provide a detailed explanation of why you need a limit increase, including the nature of your workload, the size of your dataset, and the complexity of the model.
Demonstrate cost optimization – show that using Spot instances is a cost-effective solution compared to using on-demand instances, which can help justify the increase in limit.
Highlight past successes – showcase past successes in training large language models, using similar or the same resources, to demonstrate your ability to manage the workload effectively.
Be specific – provide detailed information on the exact type of instance you need and the number required to complete the task.
Plan ahead – request the limit increase well in advance to allow enough time for approval and procurement of the required resources.

The Size of Large Language Models (LLMs)

Large Language Models (LLMs) like GPT-3 or ChatGPT can get extremely large, often requiring hundreds of gigabytes or even terabytes of storage space. For example, GPT-3 has 175 billion parameters and requires 800 GB of storage space. The size of these models makes it challenging to move data from one instance to another, increasing the overall cost and time required to train the models.

Challenges of Moving Data between Instances

Moving data between instances can be a time-consuming and expensive process. It involves copying large amounts of data over the network, which can take hours or even days, depending on the size of the data and the network speed. Additionally, copying data can also incur additional costs, such as network bandwidth fees, which can add up quickly.

Use object storage

Storing data in object storage services like Amazon S3 or Google Cloud Storage can make it easier to move data between instances. Object storage services allow you to store and retrieve large amounts of data quickly and efficiently, reducing the time and cost of moving data between instances.

Connecting to a Spot Instance using SSH: Challenges and Solutions

When using Spot instances for training large language models, connecting to the instance using SSH can be a challenge.

Firewall settings: Firewall settings can prevent you from connecting to your Spot instance using SSH. If your firewall is not set up correctly, you may receive a “Connection Refused” error when trying to connect to the instance.

Solution: To overcome this challenge, ensure that the security group settings for your Spot instance allow incoming SSH traffic. You can do this by adding a rule to the security group to allow incoming traffic on port 22, which is the default port used for SSH.

Max Spot Request Count Exceeded: Causes and Solutions

Another challenge that can occur when using Spot instances for training large language models is the “Max Spot Request Count Exceeded” error. This error occurs when you have submitted the maximum number of Spot instance requests that you can submit in a specific period.

Here are some common causes of the “Max Spot Request Count Exceeded” error and ways to avoid it:

AWS account limits: Your AWS account may have a limit on the number of Spot instance requests that you can submit in a specific period.

Solution: To avoid this error, you can request a limit increase from AWS. You may need to provide information on how you plan to use the additional requests and the expected workload. If your request is approved, you can submit more requests and continue your training.

Spot instance launch frequency: When using Spot instances, there is a limit on how frequently you can launch new instances. If you exceed this limit, you may receive the “Max Spot Request Count Exceeded” error.

Solution: To avoid this error, you can adjust the frequency of Spot instance launches. This can include launching instances less frequently or using tools like EC2 Auto Scaling to launch new instances based on workload and availability.

Spot instance termination: Similar to the “Max Spot Instance Count Exceeded” error, the risk of Spot instance termination can also cause the “Max Spot Request Count Exceeded” error. When a Spot instance is terminated, it counts as a new request.

Solution: To avoid this error, you can use the same strategies as mentioned before to manage the risk of Spot instance termination. This can include launching a mix of Spot and On-Demand instances or using tools like EC2 Auto Scaling to maintain a minimum number of running instances.

Picking the Right AMI with Drivers Installed: Challenges and Solutions

When training large language models using Spot instances, it’s important to choose the right Amazon Machine Image (AMI) that includes the necessary drivers installed. This can be a challenging task, but there are solutions to help streamline the process.

Here are some common challenges when picking the right AMI with drivers installed and ways to overcome them:

Finding the right AMI: With so many different AMIs available on the AWS Marketplace, it can be challenging to find the one that includes the specific drivers required for your workload.

Solution: One solution is to use AWS Deep Learning AMIs, which include popular deep learning frameworks like TensorFlow, PyTorch, and MXNet. These AMIs also include pre-installed drivers for popular GPUs like NVIDIA, making it easier to get started with training large language models.

Customizing the AMI: In some cases, you may need to customize the AMI to include additional drivers or software required for your specific workload.

Solution: To customize an AMI, you can use the AWS Systems Manager to create a custom image that includes the necessary drivers and software. This image can then be used to launch Spot instances for your training workload.

Keeping drivers up to date: As new versions of drivers are released, it can be challenging to keep your AMI up to date with the latest drivers.

Solution: To keep your AMI up to date with the latest drivers, you can use automation tools like AWS Systems Manager Automation to automate the process of updating the drivers. This can help ensure that your AMI is always up to date and optimized for your training workload.

Conclusion

In this post, we discussed the cost-saving strategies for training large language models like ChatGPT/GPT-4 using AWS Spot instances. We first explained why training large language models can be expensive and gave examples of how expensive it can get. We then talked about how Spot instances can be an effective solution for cost-saving and how creating a boot snapshot volume can help resume work when a new instance is created.

We also addressed challenges like AWS imposing limits on the type of spot instances that have GPUs, connecting to the spot instance using SSH, and avoiding the “Max Spot Instance Count Exceeded” and “Max Spot Request Count Exceeded” errors. Lastly, we talked about the importance of choosing the right AMI with drivers installed and provided solutions to overcome the challenges of finding the right AMI and keeping drivers up to date.

In conclusion, training large language models can be a costly process, but using AWS Spot instances, creating a boot snapshot volume, and choosing the right AMI with drivers installed can help optimize cost and performance. Additionally, it is important to plan and manage the availability of Spot instances and minimize the risk of errors to ensure a smooth training process.