### Introduction

As language models continue to gain popularity, it is important to have a reliable method of measuring their performance.

Perplexity and entropy are commonly used metrics to evaluate the performance of language models, including GPT-3 and GPT-4.

### Perplexity: A Measure of Good Guessing

Perplexity is calculated as the inverse probability of the next word, normalized by the length of the sequence.

Imagine you have a really good friend who loves to play guessing games with you. You close your eyes, and your friend says a word, and you have to guess what word they said. Sometimes, your friend says words that are easy to guess, like “cat” or “dog”. Other times, they say words that are harder to guess, like “rhinoceros” or “platypus”.

The game is more fun when you can make better guesses, right? If your friend always says easy words, the game is too easy and not as fun. But if your friend always says hard words, the game is too hard and not as fun either.

Now, imagine that we have a computer program that can play this game too. We give it a bunch of words, and it has to guess the next word in the sequence. Just like you, sometimes the program will guess right, and sometimes it will guess wrong.

Perplexity is a way to measure how good the program is at guessing the next word. It’s like a score that tells us how often the program makes good guesses. A low perplexity means the program is very good at guessing, and a high perplexity means the program isn’t very good.

### Entropy: A Measure of Difficulty

Entropy is another way to measure how hard the game is. It’s like a measure of how hard it is to guess the right answer. If your friend says easy words, the game has low entropy, but if your friend says hard words, the game has high entropy.

So, perplexity and entropy are two different ways of measuring the same thing: how good a program is at guessing the next word, and how hard the guessing game is. When the game is easy, the program has low entropy and low perplexity. But when the game is hard, the program has high entropy and high perplexity.

### How to Calculate Entropy

Once we have a language model that assigns probabilities to each word, we can compute the entropy of the sequence as follows:

Iterate over each word in the sequence W.

For each word, calculate the probability of that word given the previous words in the sequence, using the language model.

Use the probability to compute the entropy of that word as follows:

H(w_i) = -log2(P(w_i | w_1, …, w_{i-1}))

where w_i is the ith word in the sequence, P(w_i | w_1, …, w_{i-1}) is the probability of the ith word given the preceding words, and log2 is the base-2 logarithm.

Sum up the entropies for all words in the sequence to get the total entropy of the sequence:

H(W) = sum(H(w_i)) for all i

The entropy of the sequence gives us a measure of how unpredictable the sequence is according to the language model. A sequence with low entropy is more predictable and has a lower level of uncertainty, while a sequence with high entropy is less predictable and has a higher level of uncertainty.

We calculate the entropy using the formula above, and then calculate the perplexity using the formula

Perplexity = 2^H.

### Cross Entropy: A Measure of Difference

Cross entropy is a type of entropy that’s used to compare two probability distributions. To understand the difference between cross entropy and entropy, let’s go back to the candy jar example. In the example I gave earlier, we were trying to count the number of different colors of candy in the jar. This is like calculating the entropy of the candy jar – we’re trying to figure out how much uncertainty there is in the distribution of colors.

Now, let’s say we have another jar of candy that’s very similar to the first jar, but we’re not sure if it’s exactly the same. We want to compare the distribution of colors in the two jars to see if they’re similar or different. This is where cross entropy comes in – we can use cross entropy to compare the distribution of colors in one jar to the distribution of colors in the other jar. The cross entropy will be lower if the two distributions are similar, and higher if they’re different.

The equation for cross entropy looks like this:

Cross entropy = -Σ (p(x) * log(q(x)))

### The Candy Jar Game: An Analogy for Cross Entropy

Cross entropy is like a game where you have to guess which candy your friend likes, but your friend keeps changing their mind about which candy they like the most. You keep guessing different candies, and your friend tells you if you’re getting closer or further away from their current favorite.

The equation for cross entropy is like a way to keep score in this game. The “p(x)” part of the equation represents the true probability that your friend will like a certain candy. The “q(x)” part of the equation represents the probability that you guessed for that candy.

###

Conclusion

As language models like GPT-3 and GPT-4 become more prevalent, it is essential to have reliable methods to measure their performance. Perplexity and entropy are two common metrics used to evaluate language models.

### Key takeaways:

- Perplexity and entropy are two ways to measure the performance of language models.
- Perplexity measures how often a language model makes good guesses while entropy measures how hard it is to guess the right answer.
- Cross-entropy is used to compare two probability distributions and is a measure of how different they are from each other.
- Perplexity and entropy can be used together to provide a more complete picture of a language model’s performance.