RLHF and how ChatGPT works

Tram Ho

Also a topic related to ChatGPT, but this time purely about technology.

Today, I will talk about the technique that makes ChatGPT successful as it is today, as well as a new technique in creating chatbots that can communicate as smoothly as humans, which is Reinforcement Learning. from User Feedback ( RLHF for short). From there, it helps everyone to have an overview of how this chatbot works (based on OpenAI’s public document, so rest assured, reputable).

One small requirement is that everyone should read and learn about Reinforcement Learning first before diving into this technique. Here, I only write down the keywords to pay attention to when using RLHF for this chatbot:

  • Agent: language model (LLM) used to create text for chatbots.
  • Environment: the chat of the chatbot.
  • Policy: “strategies” that help chatbots learn how to create text so smoothly and as humanly as possible.
  • Action: here the LLMs will perform the next token prediction action.
  • Reward: the better the answer, the better.

Let’s start…

1. What is RLHF?

RLHF stands for Reinforcement Learning from Human Feedback , which means Reinforcement Learning from User Feedback . The idea behind this technique stems from the need for an improvement in the quality assessment of texts generated from language models.

Accordingly, before this technique came into being, language models (e.g. machine translation models) mainly evaluated the quality of the text produced using loss functions . Later people use more advanced techniques, such as BLEU or ROUGE to evaluate the output of models with some pre-made samples created by humans by simple calculations. However, languages ​​are inherently diverse and complex, so why not use the user’s own feedback to evaluate the machine-generated text, and further use those same feedback to optimize the model? The idea of ​​RLHF was born from there.

When ChatGPT replied about how I work

2. How RLHF Works

Ok, let’s get to the main part. This technique is difficult, because it has to rely on training many models at the same time and has to go through many stages of implementation, but it also includes 3 main steps:

  • Use a pre-trained Language Model (LM).
  • Collect data and train a reward model .
  • Fine-tuning the LM above using the reward model just trained.

Now, let’s analyze it step by step:

a. Pretraining Language Models

This step will basically train a LM as usual (using available data, available architectures for each task, available optimizations, available labels, blablabla), in general. is done as usual. Here, depending on the goal, the appropriate model is selected, but there is no standard. For example, for ChatGPT, they will only use part of GPT-3 to fine-tune for the text generation task. Grab the pants, this stage just needs to pay attention to customize our model to be okay.

So what is the role of this model? Then simply generate the text. This process is a supervised learning process, so it is also called Supervised Fine-tuned ( SFT for short). Okay, fine-tune is done, then put this model aside for a while, use it later.

But generating text from an existing dataset is… boring, and unreliable from a human perspective. So what if the person who participated in the evaluation of the text was born from the LM? Let’s move on to the next step, which is to build and train a reward model .

b. Collect data and train the Reward model

The creation of a Reward model (RM) (also known as a preference model ) is seen as the beginning of further studies on RLHF, and this stage can be said to make a difference for RLHF compared with previous text evaluation techniques. That goal is to optimize the reward function in the RL problem.

So how does the RM training process go? The data must first be collected using various LMs, be it a pre-traned model or a train-from-scratch model. After we have the model, we will use these models to generate a series of documents with the same prompt (which can be translated as instructions for creating documents). And that’s it….not done. There are a lot of documents that have been generated, at this point, humans intervene in the process by evaluating these documents (ie label these documents in order from highest (best) to lowest. best). This data is then used to train our RM (very confusing). In other words, we are using the model to create data for another model, and the labeling is done by humans.

The Reward Model training process takes place as shown below: Materials for RM will include the following components:

  • Prompt
    x x : is the instruction for the pre-trained model (mentioned above) to generate the text. This prompt will be taken from a databaseEASY EASY is larger. Take for example here

    x x could be a Reddit forum post (taken from the Reddit dataset TL;DR). From this prompt we will have 2 samples

    j j and

    kk was born.

  • y i y_i (with

Then the loss function for our RM model will have the form:

l o S S ( r θ ) = E ( x , y 0 , y first , i ) EASY [ l o g ( σ ( r θ ( x , y i ) r θ ( x , y first i ) ) ) ] loss(r_theta) = -mathbb{E}_{(x, y_0, y_1, i) sim D}[log(sigma(r_theta(x, y_i)) – r_theta(x, y_ {1 – i}))))]


r θ r_theta

is the human-rated score for each pair

c. Fine-tuning RL model with RM model

At this point, after having the RM model to train the Reward, we proceed to train the RL. Ok, so here we need 2 LMs: one from step a and another one. This “other” LM has a feature, that is, it uses a policy optimization algorithm, called Proximal Policy Optimization (abbreviated as PPO ), so call this second LM the PPO model for short. The process of fine-tune RL model is as follows:

  • First, a new prompt is entered as input to the process.
  • Then we create a policy
    π b a S e pi_{base} (also called
  • At the same time, the PPO Model will also generate the text from the newly entered prompt (in the first iteration, the weights will be initialized from the SFT above).
  • After the text has been generated, the RM (trained) will evaluate the newly generated text to update the Reward Function for the PPO Model. The update function for the PPO is built on top of Kullback-Leibler Divergence (KL Divergence) , so it will have the form:
    r = r θ β r KY OFFER r = r_theta – beta r_{KL} . More specifically:

CHEAP θ ( x , y ) = r ( x , y ) β l o g [ π CHEAP OFFER ( y x ) π S F BILLION ( y x ) ] R_theta{(x, y)} = r(x, y) – beta log[frac{pi^{RL}(y|x)}{pi^{SFT}(y|x)} ]

In there

r ( x , y ) r(x, y)

is the output of RM ,

Phew, it’s finally over, it’s a bit confusing. But, it’s also how ChatGPT works.

3. How ChatGPT Works

Basically… almost exactly what I wrote above. If you don’t believe it, just look at the diagram below and compare it with what I wrote above. Because it is based on Human Feedback, this chatbot has the ability to talk like a human. This process can be summarized in three steps:

  • First, we will fine-tune an LLM (here, GPT-3) to create GPT-3.5, which is a supervised-learning model from the available data set (SFT model).
  • After that, a series of samples are created and humans will evaluate and score each sample to create RM training data.
  • Finally, once we have the SFT model and RM, we optimize the policy using the PPO algorithm to make the output look as “real” as possible.

Done, so I shared about RLHF and how ChatGPT works. It’s a bit complicated, but this is a good way to train a chatbot to respond like a real person. If anyone has any comments or corrections, please let me know. Cheers.


Understanding Reinforcement Learning from Human Feedback (RLHF): Part 1

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Learning to Summarize with Human Feedback

ChatGPT: Optimizing Language Models for Dialogue

Share the news now

Source : Viblo