Imagine if you can connect to a friends computer and use that to speed up training of your model.
Training ML models is very expensive, and ML techniques require many iterations hence they require speed.
The Solution: Distributed Computing
Distributed training is a technique that allows you to train machine learning models using multiple computers. This can be useful when you have large datasets or models that are too computationally intensive to train on a single computer.
Changing is setting up the infrastructure to enable multiple computers to work together. One way to do this is to use a Common Internet File System (CIFS) to share data and resources between the computers.
To set up a CIFS, you can use a single script that is capable of running on multiple operating systems (OS). This script can be used to install the necessary software and configure the CIFS on each computer. For example, on a Linux computer, the script can install the CIFS utilities and configure the Samba server, while on a MacOS or Windows computer, the script can install the SMBFS or SAMBA command-line utilities and mount the shared directory.
Once the CIFS is set up, you can start the model training process on each computer. By starting the same training process on different computers, you can connect the various training processes and allow them to communicate with each other.
The various training processes communicate with each other using a process called message passing. This involves sending messages between the computers to exchange information about the model training, such as the gradients of the model parameters. The computers also discover each other using a process called process discovery. This involves identifying the other computers that are part of the distributed training process and determining how to communicate with them.
In summary, distributed training allows you to train machine learning models using multiple computers. By setting up a CIFS and starting the same training process on different computers, you can connect the various training processes and enable them to communicate and discover each other. This can be useful when you have large datasets or models that are too computationally intensive to train on a single computer.
How to start, stop training on the various computers?
To start and stop model training on the computers connected to a Common Internet File System (CIFS) using PyTorch, you can use the following techniques:
- Start training on all computers simultaneously: You can start training on all computers simultaneously by running the training script on each computer at the same time. This can be done manually by starting the script on each computer, or automatically using a tool like
fabric
oransible
. - Start training on one computer and then add more computers: You can start training on one computer and then add more computers to the training process as needed. To do this, you can use the
torch.nn.DataParallel
wrapper to parallelize the model training across the computers. You can then add more computers to the training process by adding them to theDataParallel
wrapper. For example:
import torch
# Set the data directory to the mount point
data_dir = "/path/to/mount/point"
# Load the dataset
dataset = torch.utils.data.DatasetFolder(data_dir, ...)
# Set up the model and DataParallel wrapper with one computer
model = MyModel()
model = torch.nn.DataParallel(model, device_ids=[0])
# Use DataParallel to parallelize the model training
for input, target in dataset:
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
# Add more computers to the DataParallel wrapper
model = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
This will allow you to start training on one computer and then add more computers to the training process as needed.
- Stop training on one or more computers: To stop training on one or more computers, you can simply stop the training script on those computers.
Code that demonstrates and tests the CLIFS set up
To demonstrate and test the Common Internet File System (CIFS) set up, you can write a Python script that performs the following tasks:
- Create a test file on one of the computers.
- Check if the test file is accessible from the other computers.
- Modify the test file on one of the computers.
- Check if the modifications are reflected on the other computers.
Here is an example of a Python script that demonstrates and tests the CIFS set up:
import os
# Create a test file on one of the computers
with open("Z:/test.txt", "w") as f:
f.write("This is a test file.")
# Check if the test file is accessible from the other computers
if os.path.exists("/path/to/mount/point/test.txt"):
print("Test file is accessible from the other computers.")
else:
print("Test file is not accessible from the other computers.")
# Modify the test file on one of the computers
with open("Z:/test.txt", "a") as f:
f.write("\nThis line was added later.")
# Check if the modifications are reflected on the other computers
with open("/path/to/mount/point/test.txt", "r") as f:
content = f.read()
if "This line was added later." in content:
print("Modifications are reflected on the other computers.")
else:
print("Modifications are not reflected on the other computers
Setting up CIFS
Here is a Python script that you can use to detect the operating system, IP address, and set up a Common Internet File System (CIFS) on each computer:
import platform
import os
# Detect the operating system
os_name = platform.system()
# Get the IP address
ip_address = os.popen('hostname -I').read()
ip_address = ip_address.strip()
# Set up the CIFS file system
if os_name == "Linux":
# Install the CIFS utilities
os.system("sudo apt-get install cifs-utils")
# Create a shared directory
os.system("mkdir /path/to/shared/directory")
# Edit the Samba configuration file
with open("/etc/samba/smb.conf", "a") as f:
f.write("\n[shared]\n")
f.write("path = /path/to/shared/directory\n")
f.write("writable = yes\n")
f.write("guest ok = yes\n")
# Restart the Samba server
os.system("sudo systemctl restart smbd")
elif os_name == "Darwin":
# Install the SMBFS or SAMBA command-line utilities
os.system("brew install smbfs")
# or
os.system("brew install samba")
# Create a mount point
os.system("mkdir /path/to/mount/point")
# Mount the shared directory
os.system("sudo mount_smbfs //guest@{}/shared /path/to/mount/point".format(ip_address))
elif os_name == "Windows":
# Mount the shared directory using the "net use" command
os.system("net use Z: \\\\{}\\shared".format(ip_address))
else:
print("Unsupported operating system.")
print("CIFS file system set up complete.")
Tying it up all together
To set up a distributed training environment on multiple computers using a main computer and the username and password provided by the computer owners, you can use a tool that automates the installation and configuration of software on remote servers. One such tool is ansible
, which is a configuration management tool that allows you to define the desired state of your infrastructure and then automatically enforce that state.
Here is an example of how you can use ansible
to set up a distributed training environment on multiple computers, including the Common Internet File System (CIFS), PyTorch, training script, dataset, etc., and then execute the training:
- Write an
ansible
playbook that specifies the required steps to set up the training environment. For example:
---
- name: Set up distributed training environment
hosts: all
tasks:
- name: Install Python
package:
name: python3
become: yes
- name: Install PyTorch
pip:
name: torch
become: yes
- name: Install CIFS utilities
package:
name: cifs-utils
become: yes
- name: Mount shared directory
mount:
path: /path/to/mount/point
src: //10.0.0.1/shared
fstype: cifs
opts: username=guest,password=password
become: yes
- name: Install training code
copy:
src: training.py
dest: /opt/training/training.py
become: yes
- name: Install dataset
copy:
src: dataset.zip
dest: /opt/training/dataset.zip
become: yes
unarchive:
src: /opt/training/dataset.zip
dest: /opt/training/
remote_src: true
- name: Execute training
command: python3 /opt/training/training.py
become:
---
- name: Set up distributed training environment
hosts: all
tasks:
- name: Install Python
package:
name: python3
become: yes
- name: Install PyTorch
pip:
name: torch
become: yes
- name: Install CIFS utilities
package:
name: cifs-utils
become: yes
- name: Mount shared directory
mount:
path: /path/to/mount/point
src: //10.0.0.1/shared
fstype: cifs
opts: username=guest,password=password
become: yes
- name: Install training code
copy:
src: training.py
dest: /opt/training/training.py
become: yes
- name: Install dataset
copy:
src: dataset.zip
dest: /opt/training/dataset.zip
become: yes
unarchive:
src: /opt/training/dataset.zip
dest: /opt/training/
remote_src: true
- name: Execute training
command: python3 /opt/training/training.py
become: yes
Create an ansible
inventory file that specifies the connection information for the computers. For example:
[windows]
windows_computer ansible_user=username ansible_password=password
[linux]
linux_computer ansible_user=username ansible_password=password
[mac]
mac_computer ansible_user=username ansible_ssh_private_key_file=/path/to/keyfile
Use ansible
to apply the playbook to the computers. For example:
ansible-playbook -i inventory.ini playbook.yml
To allow a non-technical person to provide the information that ansible
requires to connect to their computer, you can follow these steps:
- Explain to the person what
ansible
is and why you need access to their computer. - Provide the person with the
ansible
inventory file template that you will use to connect to their computer. This should include the hostname or IP address, username, and password or private key for the computer. - Ask the person to fill in the
ansible
inventory file template with the appropriate information for their computer. - To find the hostname or IP address of the computer, the person can do the following:
- On Windows:
- Press the
Windows
+R
keys to open theRun
dialog. - Type
cmd
and pressEnter
to open the command prompt. - Type
ipconfig
and pressEnter
. - Look for the
IPv4 Address
field. This is the IP address of the computer.
- Press the
- On MacOS:
- Click the
Apple
menu and selectSystem Preferences
. - Click the
Network
icon. - Select the active network connection (e.g., Wi-Fi) from the list on the left.
- Click the
Advanced
button. - Click the
TCP/IP
tab. - The
IP address
field displays the IP address of the computer.
- Click the
- On Linux:
- Open a terminal.
- Type
ifconfig
and pressEnter
. - Look for the
inet
field. This is the IP address of the computer.
- To find the username and password of the computer, the person can do the following:
- On Windows:
- Press the
Windows
+R
keys to open theRun
dialog. - Type
control userpasswords2
and pressEnter
to open theUser Accounts
dialog. - Click the
Manage another account
link. - Click the account that you want to use to connect to the computer.
- Click the
Change the account name
orChange the password
button to view or change the username or password for the account.
- Press the
- On MacOS:
- Click the
Apple
menu and selectSystem Preferences
. - Click the
Users & Groups
icon. - Click the
Lock
icon and enter the administrator password to make changes. - Click the user account that you want to use to connect to the computer.
- Click the
Change Password
button to view or change the password for the account.
- Click the
- On Linux:
- Open a terminal.
- Type
id
and pressEnter
to view the current user and