Uncertainty

Uncertainty measures how much a model is deciding randomly between multiple ways of continuing the output, indicating the model’s confidence level in its responses.

Uncertainty is measured at both the token level and the response level:

Token-level uncertainty: Indicates how confident the model is about each individual token given the preceding tokens
Response-level uncertainty: Represents the maximum token-level uncertainty across all tokens in the model’s response

Higher uncertainty scores indicate the model is less certain about its output, which often correlates with:

Hallucinations
Made-up facts
Citations
Areas where the model is struggling with the content

Calculation Method

Uncertainty is calculated using log probabilities from the model:

Token Analysis

For each token in the sequence, the model calculates its confidence in predicting that token based on all preceding tokens in the context.

Response Aggregation

The system identifies the highest uncertainty value across all tokens in the response to determine the overall response-level uncertainty.

Model Integration

The calculation leverages log probabilities from OpenAI’s Davinci models or Chat Completion models, available through both OpenAI and Azure platforms.

Uncertainty canonly be calculated with LLM integrations that provide log probabilities:

OpenAI

Any Evaluate runs created from the Galileo Playground or with pq.run(...), using the chosen model
Any Evaluate workflow runs using davinci-001
Any Observe workflows using davinci-001

Azure OpenAI

Any Evaluate runs created from the Galileo Playground or with pq.run(...), using the chosen model
Any Evaluate workflow runs using text-davinci-003 or text-curie-001, if available in your Azure deployment
Any Observe workflows using text-davinci-003 or text-curie-001, if available in your Azure deployment

To calculate the Uncertainty metric, we require having text-curie-001 or text-davinci-003 models available in your Azure environment to fetch log probabilities. For Galileo’s Guardrail metrics that rely on GPT calls (Factuality and Groundedness), we require using 0613 or above versions of gpt-3-5-turbo.

Optimizing Your AI System

Addressing High Uncertainty

When responses show high uncertainty scores, your model is likely struggling with the content. To improve your system:

Identify uncertainty patterns: Analyze where in responses uncertainty spikes occur.

Enhance knowledge sources: Provide better context or retrieval results for topics with high uncertainty.

Refine prompts: Add more specific instructions or constraints for areas where the model shows uncertainty.

Consider model selection: Some models may be more confident in specific domains.

Best Practices

Monitor Uncertainty Hotspots

Track tokens and phrases that consistently trigger high uncertainty to identify knowledge gaps.

Implement Confidence Thresholds

Set uncertainty thresholds to flag or reject responses that exceed acceptable uncertainty levels.

Compare Across Models

Evaluate how different models perform on the same inputs to identify which ones have lower uncertainty in your domain.

Combine with Factual Metrics

Use Uncertainty alongside Correctness metrics to identify correlations between model confidence and factual accuracy.

When analyzing Uncertainty, remember that some level of uncertainty is normal and even desirable in certain contexts. Very low uncertainty might indicate the model is being overly deterministic or repeating memorized patterns rather than reasoning about the content.

On this page

Calculation Method
Optimizing Your AI System
Best Practices

Uncertainty measures how much a model is deciding randomly between multiple ways of continuing the output, indicating the model’s confidence level in its responses.

Uncertainty is measured at both the token level and the response level:

Token-level uncertainty: Indicates how confident the model is about each individual token given the preceding tokens
Response-level uncertainty: Represents the maximum token-level uncertainty across all tokens in the model’s response

Higher uncertainty scores indicate the model is less certain about its output, which often correlates with:

Hallucinations
Made-up facts
Citations
Areas where the model is struggling with the content

Calculation Method

Uncertainty is calculated using log probabilities from the model:

Token Analysis

For each token in the sequence, the model calculates its confidence in predicting that token based on all preceding tokens in the context.

Response Aggregation

The system identifies the highest uncertainty value across all tokens in the response to determine the overall response-level uncertainty.

Model Integration

The calculation leverages log probabilities from OpenAI’s Davinci models or Chat Completion models, available through both OpenAI and Azure platforms.

Uncertainty canonly be calculated with LLM integrations that provide log probabilities:

OpenAI

Any Evaluate runs created from the Galileo Playground or with pq.run(...), using the chosen model
Any Evaluate workflow runs using davinci-001
Any Observe workflows using davinci-001

Azure OpenAI

Any Evaluate runs created from the Galileo Playground or with pq.run(...), using the chosen model
Any Evaluate workflow runs using text-davinci-003 or text-curie-001, if available in your Azure deployment
Any Observe workflows using text-davinci-003 or text-curie-001, if available in your Azure deployment

Optimizing Your AI System

Addressing High Uncertainty

When responses show high uncertainty scores, your model is likely struggling with the content. To improve your system:

Identify uncertainty patterns: Analyze where in responses uncertainty spikes occur.

Enhance knowledge sources: Provide better context or retrieval results for topics with high uncertainty.

Refine prompts: Add more specific instructions or constraints for areas where the model shows uncertainty.

Consider model selection: Some models may be more confident in specific domains.

Best Practices

Monitor Uncertainty Hotspots

Track tokens and phrases that consistently trigger high uncertainty to identify knowledge gaps.

Implement Confidence Thresholds

Set uncertainty thresholds to flag or reject responses that exceed acceptable uncertainty levels.

Compare Across Models

Evaluate how different models perform on the same inputs to identify which ones have lower uncertainty in your domain.

Combine with Factual Metrics

Use Uncertainty alongside Correctness metrics to identify correlations between model confidence and factual accuracy.

On this page

Calculation Method
Optimizing Your AI System
Best Practices

Calculation Method

OpenAI

Azure OpenAI

Optimizing Your AI System

Addressing High Uncertainty

Best Practices

Monitor Uncertainty Hotspots

Implement Confidence Thresholds

Compare Across Models

Combine with Factual Metrics

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

Uncertainty

Calculation Method

OpenAI

Azure OpenAI

Optimizing Your AI System

Addressing High Uncertainty

Best Practices

Monitor Uncertainty Hotspots

Implement Confidence Thresholds

Compare Across Models

Combine with Factual Metrics

​Calculation Method

​OpenAI

​Azure OpenAI

​Optimizing Your AI System

Addressing High Uncertainty

​Best Practices

Monitor Uncertainty Hotspots

Implement Confidence Thresholds

Compare Across Models

Combine with Factual Metrics

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

​Calculation Method

​OpenAI

​Azure OpenAI

​Optimizing Your AI System

Addressing High Uncertainty

​Best Practices

Monitor Uncertainty Hotspots

Implement Confidence Thresholds

Compare Across Models

Combine with Factual Metrics

Calculation Method

OpenAI

Azure OpenAI

Optimizing Your AI System

Best Practices

Calculation Method

OpenAI

Azure OpenAI

Optimizing Your AI System

Best Practices