PPO loss function is mainly comprised of two losses

- Policy Loss
- Value Loss

#### Story of the two losses

What PPO does is make the language model generate responses that are highly rated (value loss), while forcing it not change the generated responses too much (policy loss)

So what PPO does is make the least amount of modifications to the generated response with a low rating, to a response that has a high rating.

For e.g. If the non-PPO model generated the response to the query, “What is the capital of france?”, as “The capital of France is Paris” which has a high rating from a human then it will leave it as is, but it would not pick the response “The capital of Spain is Madrid”, even though it has a high human rating, because the two responses are very different, and we are forcing the model to stay close to its original response.

#### Policy Loss

Policy loss is comparing the probabilities from the old policy and the new policy. Policy loss would measure how well the current policy (i.e., the model’s strategy for choosing words in a review) aligns with the old policy. For example, if the old policy was to always use positive words (such as “amazing” and “excellent”) in the reviews, and the new policy is to sometimes use positive words and sometimes use neutral or negative words, the policy loss would be high because the two policies are not very similar.

`policy_loss = torch.exp(logprob - old_logprobs)`

#### Value Loss

It measures how well the model predicts the expected reward (i.e., the rating) of each response generated. For example, if the model predicts that a particular review will receive a high rating, but in reality it receives a low rating, the value loss would be high because the model’s prediction was incorrect.

**Value loss** is calculated as the squared difference between the predicted rewards and the actual rewards.

**Actual reward** are the scores coming directly from the training set. These are scores given to the query, response sample from the training set. Here is an example of a training sample.

```
query: "What is the capital of France?"
response: "The capital of France is Paris."
score: 1.0
```

Note 1: The actual reward in this case would be 1.0, These scores can come from a variety of sources, such as a reward function, a value function, or a human evaluator. In practice a reward model is trained to generate the scores, given a query, response pair.

Note 2: A value is subtracted from the actual reward. This value is high if the tokens generated from the PPO trained model are very different from the original reference model, and low if they are the same. This is known as KL divergence.

**Predicted reward** is simply the values coming from the value head added to the GPT-2 model being trained. This value head contain an estimation of the rewards that the model is predicting at this training step.

The predicted reward is calculated as follows:

Predicted reward = expected reward + advantage

where:

- expected reward is the expected value of the reward that the agent will receive for a particular action, based on the current policy
- advantage is the difference between the expected reward for the action and the expected reward for the current policy

In mathematical notation, the predicted reward can be written as:

`R̂(s, a) = ∑r p(r|s, a) + A(s, a)`

where:

```
R̂(s, a) is the predicted reward for action a in state s
p(r|s, a) is the probability of receiving reward r for action a in state s
A(s, a) is the advantage of action a in state s
```

#### Advantages

**Advantages** are a measure of how good an action is compared to the average action taken by the current policy. To understand the role of advantages in PPO, it is helpful to consider a simple example. Imagine that you are trying to decide which of two paths to take through a maze to reach a treasure chest. One path is longer and more winding, but has a higher probability of leading to the treasure. The other path is shorter and more direct, but has a lower probability of leading to the treasure.

If you were using the PPO algorithm to make this decision, you would calculate the predicted reward for each path, which is the expected value of the treasure that you would find at the end of each path. You could then calculate the advantage of each path by subtracting the predicted reward for the other path from the predicted reward for the path you are considering. The other path over here is simply the average of all the paths that can be taken.

The advantage function is calculated as follows:

`A(s, a) = Q(s, a) - V(s)`

where:

Q(s, a) is the action-value function, which estimates the expected reward for action a in state s

```
Q(s, a) is the action-value function, which estimates the expected reward for action a in state s
V(s) is the state-value function, which estimates the expected reward for being in state s
```

**State-value function**

There are several ways to calculate the state-value function, depending on the specific problem and the available information. Some common approaches include:

- Monte Carlo evaluation: In this approach, the state-value function is calculated by averaging the rewards that are received over a number of episodes or interactions with the environment. For example, if the agent is in state s and takes a number of actions, the state-value function can be calculated as the average of the rewards that are received for those actions.
- Temporal Difference (TD) learning: In this approach, the state-value function is updated based on the difference between the expected reward and the actual reward that is received. For example, if the agent is in state s and takes an action that leads to a reward of r and a new state s’, the state-value function can be updated using the following equation:

```
V(s) = V(s) + α(r + γV(s') - V(s))
where α is the learning rate and γ is the discount factor.
```

- Neural networks: The state-value function can also be approximated using a neural network, which is trained to predict the expected reward for a given state. The neural network can be trained using supervised learning, by providing it with a dataset of states and corresponding rewards, or using reinforcement learning, by providing it with a reward signal as it interacts with the environment.

**Action-value function**

Because then we can subtract the old and new policy logits and then take the exponent to get back the raw unnormalised probabilities. Raw unnormalised probabilities can be large numbers, using log we scale them down so they can be subtracted, and using exp we get back to the raw unnormalized difference between the old and the new policy.

What is a value head?

the value head predicts the future rewards of a review based on the current sequence of words in the review. It is an additional output head.

What is GPT2HeadWithValueModel?

```
advantages = rewards[:, t]
returns = advantages + values # rewards from the training set
vf_losses1 = (vpred - returns)**2
```

#### KL Controllers

In the context of a Proximal Policy Optimization (PPO) algorithm, a KL controller is a mechanism for controlling the trade-off between exploration and exploitation in the algorithm. The KL controller determines the amount of change that is allowed in the policy between successive optimization steps.

```
kl = logprob - ref_logprob
non_score_reward = -self.kl_ctl.value * kl
non_score_rewards.append(non_score_reward)
reward = non_score_reward.clone()
reward[-1] += score
rewards.append(reward)
```

An Adaptive KL Controller adjusts the amount of allowed change in the policy based on the performance of the algorithm. For example, if the algorithm is making good progress, the Adaptive KL Controller may allow more change in the policy in order to explore new areas of the solution space. If the algorithm is not making good progress, the Adaptive KL Controller may restrict the amount of change in the policy in order to focus on exploitation and make better use of the existing solution space.

A Fixed KL Controller, on the other hand, uses a fixed amount of allowed change in the policy for every optimization step. This means that the algorithm will always explore or exploit the solution space in the same way, regardless of its performance. This can make the algorithm less efficient, because it may not be able to adapt to changing circumstances. However, it can also make the algorithm more stable, because the policy will not change too quickly and the algorithm will not overfit to a specific part of the solution space.

actual reward is simply the reward for the current training step (along with the bound that the PPO trained model is not very different from the original reference model).

The actual reward is calculated by a function that takes three arguments: `scores`

, `logprobs`

, and `ref_logprobs`

. These are likely the scores for each token (action), the log probabilities of the tokens given the current policy, and the log probabilities of the tokens given the reference policy, respectively. This is done to ensure that the model does not start generating responses that are very different from the responses it was generating earlier before training. Scores are updated in each training step. Scores come directly from the additional output head we added to GPT-2. This additional value head represents the reward for the corresponding tokens. So actual reward is simply the reward for the current training step (along with the bound that the PPO trained model is not very different from the original reference model).

The function then iterates over each of these values and calculates the reward for each action. The reward is composed of two parts: the KL-penalty and the score. The KL-penalty is a measure of the difference between the current policy and the reference policy, and is calculated by subtracting the reference log probability from the current log probability. This difference is then multiplied by a factor `kl_ctl.value`

to calculate the KL-penalty reward.

The final reward for each action is the sum of the KL-penalty reward and the score. The function returns a list of rewards and a list of KL-penalty rewards.

Policy loss is difference between words predicted by the current policy and the words predicted by the older policy. This loss exists because we don’t want the policy to change so drastically with each step.

`vf_losses1 = (vpred - returns)**2`

How is the predicted value calculated?

`vpred`

is the predicted value of the current state, which is the expected future reward if the agent were to act optimally from that state. model_input is combined query and response tokens. Logits is **the unnormalized final scores of your model**. You apply softmax to it to get a probability distribution over your classes.

To calculate the value loss of a state in a language model trained with PPO, we would first need to calculate the returns of the state. The returns of the state are the actual value of the state, since they represent the expected future rewards of the state. Then, we would need to predict the value of the state using the model. The predicted value is the output of the model’s “value function”, which estimates the expected future rewards of the state. Finally, we would calculate the

What is the value function?

The value function of the agent would be a function that estimates the expected value of the rewards the agent will receive if it follows its current policy from a given state.

What is meant by actual value of the state?

“Actual value” of a state refers to the expected future rewards of the state. The actual value of a state is the “ground truth” value that the model is trying to predict. For instance, if the agent’s current policy is to always move the rook to the left, and this leads to a win with high probability, the actual value of the current state would be high. On the other hand, if the current policy leads to a loss with high probability, the actual value of the state would be low.

` logits, _, vpred = self.model(model_input)`

In the code above, vpred, is the rewards that the response got. vpred, stands for values predicted. Values stands for the lawyer before softmax is applied, so values are nothing but the predicted rewards

` vf_losses1 = (vpred - returns)**2`

Value loss is the difference between the predicted rewards and the actual rewards. Returns is calculated as follows

What is state?

The state of the model at a given time would be the sequence of words that have been generated so far in the review. For instance, if the first few words of the review are “The movie was”, the state of the model at this point would be “The movie was”.

What is expected rewards of a given state?

The expected future rewards of a state would be the expected value of the ratings that the review will receive if the model follows its current policy (i.e., its strategy for choosing words) from that state. Expected future rewards is the expected value of the ratings (i.e., the rewards) that the review will receive from viewers. For instance, if the model’s current policy is to always use positive words (such as “amazing” and “excellent”) in the reviews, and this leads to high ratings with high probability, the expected future rewards of the current state would be high. On the other hand, if the current policy leads to low ratings with high probability, the expected future rewards would be low.

In this line of code, `delta`

is being calculated as the sum of the rewards at time `t`

and the product of the `gamma`

parameter and the `nextvalues`

variable, minus the `values`

at time `t`

.

The `gamma`

parameter is a value that determines the importance of future rewards. It is typically a value between 0 and 1, where 0 indicates that future rewards are not important and 1 indicates that future rewards are equally as important as current rewards.

` delta = rewards[:, t] + self.ppo_params['gamma'] * nextvalues - values[:, t]`

The `nextvalues`

variable is likely a tensor of values that represent the predicted future rewards at each time step. The `rewards`

and `values`

tensors likely contain the observed rewards and predicted values at each time step, respectively.

By subtracting the `values`

at time `t`

from the sum of the rewards at time `t`

and the product of `gamma`

and `nextvalues`

, the `delta`

variable calculates the difference between the predicted future rewards and the observed rewards at each time step. This difference, or error, is used in the calculation of the policy gradient loss.

‘lam’ is a parameter in the PPO algorithm that stands for lambda. It is used to compute the ‘advantages’ of each action taken in the environment. The advantage of an action is a measure of how good the action is compared to the average action in the current state. The value of ‘lam’ determines how much the advantage of an action depends on the advantage of future actions. A high value of ‘lam’ means that the advantages of future actions will have a large impact on the current action’s advantage, while a low value of ‘lam’ means that the advantages of future actions will have a small impact on the current action’s advantage.

In the context of training a language model using the Proximal Policy Optimization (PPO) algorithm, “future actions” refer to the words that the model predicts to follow the current query and response. “Current actions” refer to the words in the current query and response. “Future rewards” refer to the rewards that the model receives for correctly predicting the future actions. “Current rewards” refer to the rewards that the model receives for correctly predicting the current actions.

For example, suppose we have the following query and response:

Query: “How are you feeling today?”

Response: “I’m feeling great!”

If the model is trained to predict the next word in the response given the query and previous words in the response, then the “current actions” would be the words “I’m”, “feeling”, and “great!”, and the “future actions” would be the word “!”. The “current rewards” would be the rewards that the model receives for correctly predicting each of these words, and the “future rewards” would be the reward that the model receives for correctly predicting the word “!”.

Policy loss is a measure of how good the model’s predictions are at maximizing the rewards. It is calculated by comparing the model’s predicted actions with the actual actions taken and adjusting the model’s parameters to better align the two.

Value loss is a measure of how well the model predicts the future rewards for each action. It is calculated by comparing the model’s predicted values with the actual rewards received for each action and adjusting the model’s parameters to better align the two.

For example, if we have a language model that is being trained using PPO to generate responses to a query, we can calculate the policy loss by comparing the model’s predicted response to the actual response given by a human. If the model’s predicted response is very different from the human’s response, then the policy loss will be high, indicating that the model’s predictions are not very good at maximizing the rewards.

Similarly, we can calculate the value loss by comparing the model’s predicted rewards for each response with the actual rewards received. If the model’s predicted rewards are very different from the actual rewards, then the value loss will be high, indicating that the model is not very good at predicting the future rewards for each action.

Policy loss is calculated by multiplying the advantages (which represent the expected rewards) by the ratio of the new and old probabilities of the actions taken by the model. This reflects how well the model’s actions match the actions that would maximize the expected rewards under the old policy.

We iterate through each action (or word) in the generated response and calculate the advantage of adding that word. The advantage of adding that word is calculated by rewards observed and the rewards predicted. the rewards predicted is a difference between the predicted rewards by the adding the next word, and the predicted rewards by adding the current word.

If we ignore for a moment the predicted rewards by adding the word after the current word, then we are iterating through each word in the generated response and calculating the advantage simply by subtracting the predicted rewards with the observed rewards.

` delta = rewards[:, t] - values[:, t]`

So loss is lower if predicted rewards are close to observed rewards and vice-versa. We are adding to this loss the reward gained by adding the word after the current word. How much we add the loss of the this future word depends on the gamma variable.

`delta += self.ppo_params['gamma'] * values[:, t + 1]`

Ratio is the change between the new words and the old words. Higher the change bigger the ratio. Bigger the ratio than the loss is amplified. Advantage is the difference between human response and the predicted response.

forward pass on query, responses – and in return get log probabilities, and scores. Scores are the values before softmax is applied and log probabilities are after softmax is applied.