Metrics SDK Reference
A quick lookup for integrating and interpreting metrics in your workflows.
Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. This guide provides a reference for Galileo AI’s preset metrics, including and SDK slugs.
Metrics Reference
The table below summarizes each metric’s purpose and the exact slug to reference in the SDK.
Metric | SDK Slug(s) | Description |
---|---|---|
Action Advancement | agentic_workflow_success | Measures whether a step or span moves the user closer to their overall goal within the session. |
Action Completion | agentic_session_success | Assesses if the user’s goal was ultimately achieved at the session or trace level. |
BLEU | bleu | BLEU is a case-sensitive measurement of the difference between a model generation and target generation at the sentence-level. |
Chunk Attribution | chunk_attribution_utilization_gpt | Measures whether or not each chunk retrieved in a RAG pipeline had an effect on the model’s response. |
Chunk Utilization | chunk_attribution_utilization_gpt | Measures the fraction of text in each retrieved chunk that had an impact on the model’s response in a RAG pipeline. |
Completeness | completeness_gpt | Assesses whether the response covers all necessary aspects of the prompt or question. |
Context Adherence | context_adherence_gpt | Evaluates if the LLM output is consistent with and grounded in the provided context. |
Context Relevance (Query Adherence) | context_relevance | Measures whether the retrieved context has enough information to answer the user’s query. |
Correctness (factuality) | correctness | Evaluates whether the output is factually correct based on available information. |
Ground Truth Adherence | ground_truth_adherence | Measures semantic equivalence between model output and ground truth, typically using LLM-based judgment. |
Instruction Adherence | instruction_adherence | Checks if the LLM output follows the explicit instructions given in the prompt. |
Prompt Injection | prompt_injection_gpt | Measures the presence of prompt injection attacks in inputs to the LLM. |
Prompt Perplexity | prompt_perplexity | Indicates how “surprising” or difficult the prompt is for the model. |
ROUGE | rouge | Measures the unigram overlap between model generation and target generation as a single F-1 score. |
Sexism / Bias | input_sexist_gpt , output_sexist_gpt | Measures how ‘sexist’ an input or output might be perceived as a value between 0 and 1 (1 being more sexist). |
Tone | input_tone_gpt , output_tone_gpt | Detects the tone (e.g., polite, neutral, aggressive) of the input/output. |
Tool Errors | tool_error_rate | Flags errors that occur when an agent or LLM calls a tool (e.g., API or function call fails). |
Tool Selection Quality | tool_selection_quality | Determines if the agent/LLM selected the correct tool(s) and provided appropriate arguments. |
Toxicity | input_toxicity_gpt , output_toxicity_gpt | Measures the presence and severity of harmful, offensive, or abusive language |
How do I use metrics in the SDK?
The run experiment
function (Python, TypeScript) takes a list of metrics as part of its arguments.
Preset metrics
Supply a list of one or more metric names into the run_experiment
function as shown below:
For more information, read about running experiments with the Python or the TypeScript SDK.
Custom metrics
You can use custom metrics in the same way as Galileo’s preset metrics. At a high level, this involves the following steps:
- Create your metric in the Galileo console (or in code). Your custom metric will return a numerical score based on its input.
- Pass the name of your new metric into the
run experiment
, like in the example below.
In this example, we reference a custom metric that was saved in the console with the name My custom metric
.
Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see Custom Metrics.
Which metrics require ground truth data?
Ground truth is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared.
The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact:
agentic_session_success
(Action Completion)chunk_attribution_utilization_gpt
(Chunk Attribution, Chunk Utilization)completeness_gpt
(Completeness)correctness
(Correctness (factuality))ground_truth_adherence
(Ground Truth Adherence)tool_selection_quality
(Tool Selection Quality)
Are metrics LLM-agnostic?
Yes, all metrics are designed to work across any LLM integrated with Galileo.