RLHF (Reinforcement Learning with Human Feedback) Python tutorial using TRLX

What is TRLX?

TRLX is a framework that uses Hugging Face transformers pipeline object to fine tune a model using RLHF.

Transformer Reinforcement Learning X (TRLX) is a type of artificial intelligence (AI) that combines the capabilities of the Transformer model with reinforcement learning. The Transformer model is a powerful AI technique that is commonly used in natural language processing tasks, such as language translation or text summarization. Reinforcement learning is a type of machine learning that involves rewarding or punishing a model based on its actions in order to improve its performance over time.

By combining these two techniques, TRLX allows robots to learn and adapt their behavior based on feedback from their environment and users. For example, a robot that uses TRLX might be trained to perform a specific task, such as cleaning a room or picking up objects. As the robot performs the task, it receives feedback from its sensors and users, and uses this feedback to adjust its behavior and improve its performance.

TRLX has the potential to improve the performance and efficiency of robots in a wide range of applications, including manufacturing, logistics, and customer service. By allowing robots to learn and adapt based on feedback from their environment and users, TRLX can help robots to become more intelligent, flexible, and useful.

What is RLHF?

RLHF stands for Reinforcement Learning with Human Feedback.

RLHF, or Reinforcement Learning from Human Feedback, is a type of artificial intelligence (AI) that allows robots to learn and improve their behavior based on feedback from human users. This is different from traditional AI, which is pre-programmed with a set of rules and behaviors that it follows without adapting to its environment or users.

RLHF uses a technique called reinforcement learning, which involves rewarding the robot for behaviours that are beneficial to the user, and punishing the robot for behaviours that are harmful or undesirable. Over time, this reinforcement process helps the robot to learn and adapt its behavior to better meet the needs and preferences of its human users.

Step 1 : Installation

!git clone https://github.com/CarperAI/trlx.git
%cd trlx
!pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
!pip install -e .

Step 2: Setting up the environment

from datasets import load_dataset
from transformers import pipeline
import os
import yaml

import trlx
import torch
from typing import List
from trlx.data.configs import TRLConfig

Step 3: Fine Tune your Language Model (optional)

This is an optional step but highly recommended for serious applications. We pick GPT because it is particularly good at text generation tasks. Here is an example of code that shows how to fine-tune a Hugging Face GPT language model:

# Import the Hugging Face transformers library
import transformers

# Load the GPT language model that you want to fine-tune
model = transformers.GPT2LMHeadModel.from_pretrained('<model_name>')

# Set the training parameters, such as the batch size and number of epochs
batch_size = 16
num_epochs = 5

# Define the dataset and dataloader for the model
dataset = transformers.LineByLineTextDataset(...)
dataloader = transformers.DataLoader(...)

# Set the optimizer and learning rate for the model
optimizer = transformers.AdamW(model.parameters(), lr=2e-5)

# Train the model for the specified number of epochs
for epoch in range(num_epochs):
    # Train the model on the training data
    model.train()
    for batch in dataloader:
        optimizer.zero_grad()
        outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
        loss = outputs[0]
        loss.backward()
        optimizer.step()

# Save the fine-tuned model
model.save_pretrained('<model_name>')

This code shows the basic steps for fine-tuning a Hugging Face language model. First, the model is loaded from a pre-trained checkpoint. Next, the training parameters, such as the batch size and number of epochs, are set. The dataset and dataloader are then defined, and the optimizer and learning rate are set for the model.

Finally, the model is trained for the specified number of epochs, using the training data provided in the dataloader. After training, the fine-tuned model is saved and loaded in the next step.

Step 4: Create a pipeline

sentiment_fn = pipeline(
        "sentiment-analysis",
        "lvwerra/distilbert-imdb",
        top_k=2,
        truncation=True,
        batch_size=256,
        device=device,
    )

In the code above we used the pre-trained model lvwerra/distilbert-imdb and the sentiment analysis pipeline, but it could have been your fine tuned model and a different pipeline. The Hugging Face transformers provide a variety of pipelines that can be used for different natural language processing tasks. Some of the pipelines that are available include:

Sentiment analysis: This pipeline can be used to predict the sentiment (positive, neutral, or negative) of a given text.
Text generation: This pipeline can be used to generate new text based on a given prompt or input.
Text classification: This pipeline can be used to classify a given text into one or more pre-defined categories.
Named entity recognition: This pipeline can be used to identify and extract named entities (such as people, organizations, or locations) from a given text.
Question answering: This pipeline can be used to generate answers to questions based on a given text or document.
Summarization: This pipeline can be used to generate a concise summary of a given text or document.
Translation: This pipeline can be used to translate a given text from one language to another.

These are just a few examples of the pipelines that are available with the Hugging Face transformers. There are many other pipelines available, and new ones are being added all the time.

Step 5: Create prompts

    # Take few words off of movies reviews as prompts
    imdb = load_dataset("imdb", split="train+test")
    prompts = [" ".join(review.split()[:4]) for review in imdb["text"]]

There are several different ways that you can create prompts. Some of the most common methods include:

Using a pre-defined prompt: Many language models come with a set of pre-defined prompts that can be used to generate text or answers to questions. For example, a language model might include prompts such as “Tell me a story about a robot” or “Explain the concept of reinforcement learning in simple terms.”
Providing your own prompt: In many cases, you can also create your own custom prompts by providing a sentence, paragraph, or other text as input to the language model. For example, you might provide a prompt such as “Describe the benefits of using a large language model” or “Write a poem about the beauty of nature.”

In the move review dataset we simply truncated and used the first 5 words as the prompt. The prompt will be what you have fine-tuned your model to do.

Step 6: Reward Function

def get_positive_score(scores):
    "Extract value associated with a positive sentiment from pipeline's output"
    return dict(map(lambda x: tuple(x.values()), scores))["POSITIVE"]

def reward_fn(samples: List[str]) -> List[float]:
        sentiments = list(map(get_positive_score, sentiment_fn(samples)))
        return sentiments

What this code is doing is taking the output of the pipeline, i.e. output of the prompts that you created earlier, and then scoring that output. Since in this sample example we are trying to use RLHF to generate positive reviews, we are only returning back the confidence score of the positive sentiment classification. This is key

To illustrate with another example. Imagine you have a text classification model that classifies text into various classes (For instance, Factual, Opinionated, Impactful, and Useful), and your goal is to train a model to generate more Useful text then you simply return the Useful score of the classification.

Do note that in the example above the output of the sentiment analysis pipeline in itself is a classification (Positive, Negative), but in case of text generation models like GPT you could have taken the output of the prompt and fed that output into a different text classification model, and taken the scores of the class you care about (e.g. Useful class) and then made that was the reward.

I hope you can see the immense power behind this training technique. The rewards could be coming from another model, or could be coming directly from some human raters (e.g. Amazon Mechanical Turk). In Chat GPT they went ahead and first collected ratings from human raters and then fine-tuned a language model on predicting the ratings given the pair of prompt, and generated text.

Step 7: Train using TRLX library

 model = trlx.train(
        reward_fn=reward_fn,
        prompts=prompts,
        eval_prompts=["I don't know much about Hungarian underground"] * 64,
        config=config,
    )

eval_prompts are prompts to periodically validate training on. The idea is that theses prompts will never be used in learning, but purely as a way to evaluate the training. In the example above is a primitive way to give a list of 64 prompts all repeated but in a real example these would be carefully selected prompts.

Conclusion

So there you have it, in this post I covered the logical structure of how to do RLHF training (Reinforcement Learning from Human Feedback). I intentionally skipped many details to make sure this tutorial is intuitive. In the next part I am going to cover how to actually run this training, since it can be computationally expensive. I will be focussing on the real challenges of running this training in a cost effective manner. I will be trying to make this run on a Google Colab, and will share my learnings. So stay tuned or subscribe.