How it’s organized
- Curated examples: You’ll find pre-populated data that demonstrates how metrics score different cases.
- Drill-down friendly: Open rows to compare the input/output with the metric explanation side-by-side.
- Designed for contrast: Use sorting and filtering to compare strong vs. weak examples for the same metric.
What to look for
- Score distribution: Look at the range of scores across traces to calibrate what “good” and “bad” looks like for that metric.
- Explanations: Open a handful of rows and read the metric explanation carefully — it’s often the quickest way to learn the rubric the judge is applying.
- Edge cases: Pay special attention to traces that surprise you (high score when you expected low, or vice versa). These are the best starting points for refining prompts, tools, or evaluation criteria.
- Metric interplay: Some failures show up across multiple metrics. Use the examples to learn when you should monitor a second metric alongside your primary one.
A quick tour
Pick one metric you care about
Start from the relevant metric documentation page, then jump into the corresponding examples in Preset Metric Examples.
Review the best and worst traces
Sort by the metric value and open a few of the highest-scoring and lowest-scoring rows.
Extract reusable patterns
Keep track of 2–3 patterns that correlate with strong scores (and 2–3 patterns that correlate with weak scores). These become concrete hypotheses you can test in your own app.
Jump into metric documentation
Response Quality metrics
Explore metrics focused on answer quality and grounding.
Agentic AI metrics
Explore metrics for multi-step agents, tool use, and trajectories.
Safety and Compliance metrics
Explore metrics focused on harmful content and prompt attacks.
Text-to-SQL metrics
Explore metrics for query correctness, adherence, efficiency, and safety.
Next steps
- Learn how to enable metrics on your own Log Streams: Configure metrics
- Browse all out-of-the-box metrics: Metrics overview
- Compare metrics and decide what to monitor: Metric comparison