Practical guide for training or fine tuning large language models

Imagine if you can connect to a friends computer and use that to speed up training of your model.

Training ML models is very expensive, and ML techniques require many iterations hence they require speed.

The Solution: Distributed Computing

Distributed training is a technique that allows you to train machine learning models using multiple computers. This can be useful when you have large datasets or models that are too computationally intensive to train on a single computer.

Changing is setting up the infrastructure to enable multiple computers to work together. One way to do this is to use a Common Internet File System (CIFS) to share data and resources between the computers.

To set up a CIFS, you can use a single script that is capable of running on multiple operating systems (OS). This script can be used to install the necessary software and configure the CIFS on each computer. For example, on a Linux computer, the script can install the CIFS utilities and configure the Samba server, while on a MacOS or Windows computer, the script can install the SMBFS or SAMBA command-line utilities and mount the shared directory.

Once the CIFS is set up, you can start the model training process on each computer. By starting the same training process on different computers, you can connect the various training processes and allow them to communicate with each other.

The various training processes communicate with each other using a process called message passing. This involves sending messages between the computers to exchange information about the model training, such as the gradients of the model parameters. The computers also discover each other using a process called process discovery. This involves identifying the other computers that are part of the distributed training process and determining how to communicate with them.

In summary, distributed training allows you to train machine learning models using multiple computers. By setting up a CIFS and starting the same training process on different computers, you can connect the various training processes and enable them to communicate and discover each other. This can be useful when you have large datasets or models that are too computationally intensive to train on a single computer.

How to start, stop training on the various computers?

To start and stop model training on the computers connected to a Common Internet File System (CIFS) using PyTorch, you can use the following techniques:

Start training on all computers simultaneously: You can start training on all computers simultaneously by running the training script on each computer at the same time. This can be done manually by starting the script on each computer, or automatically using a tool like fabric or ansible.
Start training on one computer and then add more computers: You can start training on one computer and then add more computers to the training process as needed. To do this, you can use the torch.nn.DataParallel wrapper to parallelize the model training across the computers. You can then add more computers to the training process by adding them to the DataParallel wrapper. For example:

import torch

# Set the data directory to the mount point
data_dir = "/path/to/mount/point"

# Load the dataset
dataset = torch.utils.data.DatasetFolder(data_dir, ...)

# Set up the model and DataParallel wrapper with one computer
model = MyModel()
model = torch.nn.DataParallel(model, device_ids=[0])

# Use DataParallel to parallelize the model training
for input, target in dataset:
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

# Add more computers to the DataParallel wrapper
model = torch.nn.DataParallel(model, device_ids=[0, 1, 2])

This will allow you to start training on one computer and then add more computers to the training process as needed.

Stop training on one or more computers: To stop training on one or more computers, you can simply stop the training script on those computers.

Code that demonstrates and tests the CLIFS set up

To demonstrate and test the Common Internet File System (CIFS) set up, you can write a Python script that performs the following tasks:

Create a test file on one of the computers.
Check if the test file is accessible from the other computers.
Modify the test file on one of the computers.
Check if the modifications are reflected on the other computers.

Here is an example of a Python script that demonstrates and tests the CIFS set up:

import os

# Create a test file on one of the computers
with open("Z:/test.txt", "w") as f:
    f.write("This is a test file.")

# Check if the test file is accessible from the other computers
if os.path.exists("/path/to/mount/point/test.txt"):
    print("Test file is accessible from the other computers.")
else:
    print("Test file is not accessible from the other computers.")

# Modify the test file on one of the computers
with open("Z:/test.txt", "a") as f:
    f.write("\nThis line was added later.")

# Check if the modifications are reflected on the other computers
with open("/path/to/mount/point/test.txt", "r") as f:
    content = f.read()
    if "This line was added later." in content:
        print("Modifications are reflected on the other computers.")
    else:
        print("Modifications are not reflected on the other computers

Setting up CIFS

Here is a Python script that you can use to detect the operating system, IP address, and set up a Common Internet File System (CIFS) on each computer:

import platform
import os

# Detect the operating system
os_name = platform.system()

# Get the IP address
ip_address = os.popen('hostname -I').read()
ip_address = ip_address.strip()

# Set up the CIFS file system
if os_name == "Linux":
    # Install the CIFS utilities
    os.system("sudo apt-get install cifs-utils")

    # Create a shared directory
    os.system("mkdir /path/to/shared/directory")

    # Edit the Samba configuration file
    with open("/etc/samba/smb.conf", "a") as f:
        f.write("\n[shared]\n")
        f.write("path = /path/to/shared/directory\n")
        f.write("writable = yes\n")
        f.write("guest ok = yes\n")

    # Restart the Samba server
    os.system("sudo systemctl restart smbd")

elif os_name == "Darwin":
    # Install the SMBFS or SAMBA command-line utilities
    os.system("brew install smbfs")
    # or
    os.system("brew install samba")

    # Create a mount point
    os.system("mkdir /path/to/mount/point")

    # Mount the shared directory
    os.system("sudo mount_smbfs //guest@{}/shared /path/to/mount/point".format(ip_address))

elif os_name == "Windows":
    # Mount the shared directory using the "net use" command
    os.system("net use Z: \\\\{}\\shared".format(ip_address))

else:
    print("Unsupported operating system.")

print("CIFS file system set up complete.")

Tying it up all together

To set up a distributed training environment on multiple computers using a main computer and the username and password provided by the computer owners, you can use a tool that automates the installation and configuration of software on remote servers. One such tool is ansible, which is a configuration management tool that allows you to define the desired state of your infrastructure and then automatically enforce that state.

Here is an example of how you can use ansible to set up a distributed training environment on multiple computers, including the Common Internet File System (CIFS), PyTorch, training script, dataset, etc., and then execute the training:

Write an ansible playbook that specifies the required steps to set up the training environment. For example:

---
- name: Set up distributed training environment
  hosts: all
  tasks:
    - name: Install Python
      package:
        name: python3
      become: yes

    - name: Install PyTorch
      pip:
        name: torch
      become: yes

    - name: Install CIFS utilities
      package:
        name: cifs-utils
      become: yes

    - name: Mount shared directory
      mount:
        path: /path/to/mount/point
        src: //10.0.0.1/shared
        fstype: cifs
        opts: username=guest,password=password
      become: yes

    - name: Install training code
      copy:
        src: training.py
        dest: /opt/training/training.py
      become: yes

    - name: Install dataset
      copy:
        src: dataset.zip
        dest: /opt/training/dataset.zip
      become: yes
      unarchive:
        src: /opt/training/dataset.zip
        dest: /opt/training/
        remote_src: true

    - name: Execute training
      command: python3 /opt/training/training.py
      become:
---
- name: Set up distributed training environment
  hosts: all
  tasks:
    - name: Install Python
      package:
        name: python3
      become: yes

    - name: Install PyTorch
      pip:
        name: torch
      become: yes

    - name: Install CIFS utilities
      package:
        name: cifs-utils
      become: yes

    - name: Mount shared directory
      mount:
        path: /path/to/mount/point
        src: //10.0.0.1/shared
        fstype: cifs
        opts: username=guest,password=password
      become: yes

    - name: Install training code
      copy:
        src: training.py
        dest: /opt/training/training.py
      become: yes

    - name: Install dataset
      copy:
        src: dataset.zip
        dest: /opt/training/dataset.zip
      become: yes
      unarchive:
        src: /opt/training/dataset.zip
        dest: /opt/training/
        remote_src: true

    - name: Execute training
      command: python3 /opt/training/training.py
      become: yes

Create an ansible inventory file that specifies the connection information for the computers. For example:

[windows]
windows_computer ansible_user=username ansible_password=password

[linux]
linux_computer ansible_user=username ansible_password=password

[mac]
mac_computer ansible_user=username ansible_ssh_private_key_file=/path/to/keyfile

Use ansible to apply the playbook to the computers. For example:

ansible-playbook -i inventory.ini playbook.yml

To allow a non-technical person to provide the information that ansible requires to connect to their computer, you can follow these steps:

Explain to the person what ansible is and why you need access to their computer.
Provide the person with the ansible inventory file template that you will use to connect to their computer. This should include the hostname or IP address, username, and password or private key for the computer.
Ask the person to fill in the ansible inventory file template with the appropriate information for their computer.
To find the hostname or IP address of the computer, the person can do the following:

On Windows:
1. Press the Windows + R keys to open the Run dialog.
2. Type cmd and press Enter to open the command prompt.
3. Type ipconfig and press Enter.
4. Look for the IPv4 Address field. This is the IP address of the computer.
On MacOS:
1. Click the Apple menu and select System Preferences.
2. Click the Network icon.
3. Select the active network connection (e.g., Wi-Fi) from the list on the left.
4. Click the Advanced button.
5. Click the TCP/IP tab.
6. The IP address field displays the IP address of the computer.
On Linux:
1. Open a terminal.
2. Type ifconfig and press Enter.
3. Look for the inet field. This is the IP address of the computer.

To find the username and password of the computer, the person can do the following:

On Windows:
1. Press the Windows + R keys to open the Run dialog.
2. Type control userpasswords2 and press Enter to open the User Accounts dialog.
3. Click the Manage another account link.
4. Click the account that you want to use to connect to the computer.
5. Click the Change the account name or Change the password button to view or change the username or password for the account.
On MacOS:
1. Click the Apple menu and select System Preferences.
2. Click the Users & Groups icon.
3. Click the Lock icon and enter the administrator password to make changes.
4. Click the user account that you want to use to connect to the computer.
5. Click the Change Password button to view or change the password for the account.
On Linux:
1. Open a terminal.
2. Type id and press Enter to view the current user and