Prompt Perplexity

Prompt Perplexity measures how predictable or familiar a prompt is to a language model, using the log probabilities provided by the model.

How it works

Prompt Perplexity is a continuous metric ranging from 0 to infinity:

∞

Low Perplexity

The model is highly certain about predicting tokens in the prompt

High Perplexity

The model finds the prompt less predictable

This metric helps evaluate how well your prompts are tuned to your chosen model, which research has shown correlates with better response generation.

Calculation method

Prompt Perplexity is computed through a specific mathematical process that measures how difficult it is for the model to predict each token in the prompt:

Calculate Token Probabilities

The model processes the prompt token by token
For each position, it computes the probability distribution over the next token
We extract the log probability of the actual next token that appears in the prompt

Average Log Probabilities

Sum all the log probabilities across the entire prompt
Divide by the total number of tokens to get the average
This gives us the average log probability per token

Apply Exponential

Take the negative of the average log probability
Apply the exponential function to this value
This converts log probabilities to a more interpretable perplexity score

Final Formula

Perplexity = exp(-average(log_probabilities))

Key Properties

Understanding the mathematical properties of perplexity:

Range: Always positive, with lower values indicating better predictability

Scale: Exponential scale means small changes in log probabilities can lead to large perplexity differences

Length independence: Using the average makes the metric comparable across prompts of different lengths

Availability

Prompt Perplexity can only be calculated with LLM integrations that provide log probabilities:

OpenAI

Any Evaluate runs created from the Galileo Playground or with pq.run(...), using the chosen model
Any Evaluate workflow runs using davinci-001
Any Observe workflows using davinci-001

Azure OpenAI

Any Evaluate runs created from the Galileo Playground or with pq.run(...), using the chosen model
Any Evaluate workflow runs using text-davinci-003 or text-curie-001, if available in your Azure deployment
Any Observe workflows using text-davinci-003 or text-curie-001, if available in your Azure deployment

To calculate the Prompt Perplexity metric, we require models that provide log probabilities. This typically includes older models like davinci-001, text-davinci-003, or text-curie-001.

Understanding perplexity

Interpreting Perplexity Scores

Lower Prompt Perplexity scores generally indicate better prompt quality:

Lower perplexity: Suggests your model is better tuned toward your data, as it can better predict the next token.

Research findings: The paper “Demystifying Prompts in Language Models via Perplexity Estimation” has shown that lower perplexity values in prompts lead to better outcomes in the generated responses.

Monitoring value: Tracking perplexity can help you iteratively improve your prompts.

Optimizing your AI system

Use Familiar Language

Phrase prompts using language patterns similar to the model’s training data to reduce perplexity.

Provide Clear Context

Include sufficient context that helps the model predict what comes next in the prompt.

Avoid Unusual Formatting

Use standard formatting and avoid unusual syntax that might confuse the model.

Test Variations

Experiment with different phrasings of the same prompt to find lower perplexity versions.

When optimizing for Prompt Perplexity, remember that the goal isn’t always to minimize perplexity at all costs. Sometimes a slightly higher perplexity prompt might be necessary to communicate specific or technical requirements. The key is finding the right balance for your use case.

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

Prompt Perplexity

How it works

Calculation method

Key Properties

Availability

OpenAI

Azure OpenAI

Understanding perplexity

Interpreting Perplexity Scores

Optimizing your AI system

Use Familiar Language

Provide Clear Context

Avoid Unusual Formatting

Test Variations

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

​How it works

​Calculation method

Key Properties

​Availability

​OpenAI

​Azure OpenAI

​Understanding perplexity

Interpreting Perplexity Scores

​Optimizing your AI system

Use Familiar Language

Provide Clear Context

Avoid Unusual Formatting

Test Variations

How it works

Calculation method

Availability

OpenAI

Azure OpenAI

Understanding perplexity

Optimizing your AI system