2025-06-18
Luna 2 Available in Enterprise

Key New Features

Luna 2 is now available for Enterprise Customers. Luna 2 is a major upgrade that brings purpose-built intelligence to every evaluation and guardrail use case. With a redesigned architecture and rigorous RLAIF training pipelines, Luna 2 delivers:

  • Higher-quality evaluation across 8+ dimensions, including helpfulness, correctness, coherence, verbosity, maliciousness, hallucination, and more.
  • Granular binary and scalar scoring: Flexible outputs for both detection (binary pass/fail) and precise scoring (e.g., 1–5 scale), ready to plug into your pipelines or dashboards.
  • Context-aware comparisons: Optimized for pairwise and multi-turn comparisons, with better discernment in edge cases.
  • Consistency and reproducibility: More stable than traditional LLM-as-judge methods, with high agreement across similar prompts and contexts.

Read the Research that went into Luna 2.

2025-06-13
Sessions as Graph, Log Stream Insights, Playground History, and Local Metrics

Key New Features

More powerful agent observability with updates to three complementary views—Timeline, Conversation, and Graph—designed to help you debug faster, detect issues earlier, and understand agent performance from every angle.

Trace Agent Execution in Real-Time with Timeline View

Galileo’s new Timeline View lets you step through your agent’s full execution path, making it easier to pinpoint delays and spot bottlenecks at a glance. No more digging through scattered logs—see how long each tool or agent step takes and where latency builds up.

Click on any step to inspect metadata, inputs/outputs, and nested actions, giving you full visibility into what’s slowing things down.

Debug from the User’s Perspective with Conversation View

The new Conversation View recreates the exact exchange your users experienced—from inputs to outputs—side by side with system decisions. This helps you debug how your agent logic feels in practice, not just how it functions under the hood.

Use it to:

  • Spot confusing or off-track responses

  • Validate that the system matches user intent

  • Reproduce and resolve edge cases faster

Combine with Graph View for End-to-End Observability

These new views pair well with last week’s Graph view release, which transforms traditional logs into interactive, inspectable agent flows.

Use the full trio to:

  • Graph View: Visualize decision paths and tool usage
  • Timeline View: Identify performance issues and slowdowns
  • Conversation View: Understand the user experience start to finish

With these improvements, you can get a more holistic view of agent behavior.

2025-06-05
Sessions as Graph, Log Stream Insights, Playground History, and Local Metrics

Key New Features

Faster debugging, smarter issue detection, seamless experiment saving, and custom metric support for streamlined GenAI evaluation.

Visualize Sessions with Graph View

Galileo’s new Graph View replaces traditional tree-based log visualization, enabling you to analyze complex sessions quickly. Instead of digging through a deeply nested tree with hundreds of logs, you can now explore each trace as an interactive graph.

Click any node to inspect inputs, outputs, metrics, and intermediate actions, making it easier to identify bottlenecks, trace failures, and debug long-running workflows.

Detect Issues Automatically with Log Stream Insights (Beta)

Galileo’s Log Stream Insights automatically scans your logs to surface common failure patterns and recurring issues, saving you hours of manual review. For each surfaced issue, users receive:

  • Descriptions of the detected pattern
  • Concrete examples across traces
  • Suggested remediation strategies
  • Frequency trends over time

This helps teams reduce MTTD (mean time to detect) and rapidly address performance regressions.

Preserve Work and Experiment Freely with Playground Saving & History

Galileo now automatically saves your Playground session state, so you never lose work in progress. You can:

  • Resume where you left off without manual saves
  • Save multiple sessions to explore variations in prompts and workflows
  • Access run history and log experiments for repeatability

This feature enables your team to iterate faster and collaborate more effectively within a single project environment.

Evaluate with Your Own Metrics Using Local Scorers

With Local Custom Metrics, you can now define and compute custom evaluation metrics locally using your existing Python workflows and evaluation logic. These metrics can be uploaded directly into your Galileo experiments for side-by-side comparison with built-in metrics.

This gives you complete control over your evaluation criteria while centralizing metric tracking inside Galileo experiments. Use it to:

  • Seamlessly integrate with local libraries and tools
  • Rapidly iterate on evaluation logic
  • Gain full metric visibility within your evaluations
  • Compare experiments at a glance to determine the best results
2025-05-13
Sessions, CLHF, and Playground improvements

Key New Features

Sessions

The free version of Galileo now has support for Sessions. Sessions provide users a coherent view of multi-turn interactions. The traces from each turn of the conversation can be viewed under the session.

To create a session, developers can use the Galileo Logger, using the start_session method in Python ot the startSession method in TypeScript.

Here is a multi-turn conversation about state capitals of the US:

Adapting LLM metrics with CLHF

The free Galileo offering now supports Continuous Learning for Human Feedback (CLHF) which helps users easily adapt LLM metrics for their app by providing human feedback. As you start using Galileo Preset LLM-powered metrics (e.g. Context Adherence or Instruction Adherence), or start creating your own LLM-powered metrics, you might not always agree with the results. This capability helps you solve this problem.

As you identify mistakes in your metrics, you can provide ‘feedback’ to ‘auto-improve’ your metrics. Your feedback gets translated (by LLMs) into few-shot examples that are appended to the Metric’s prompt.

This process has shown to increase accuracy of metrics by 20-30%.

Playground improvements

The playground now has an updated layout and shows a preview of the input prompt that will be run when using variable slots in your prompt template which are filled in by manually entering variables or getting them from a dataset.

2025-05-02
Metrics on experiments UI, public APIs, and more

Key New Features

Metrics on experiments UI

You can now compute additional metrics for logged experiments directly within the experiments UI. Until now, users didn’t have a way to compute more metrics for logged experiments from the UI or SDK.

Public APIs

Released public APIs to allow developers to manage log streams, experiments, and trace data programmatically. While these can already be managed through the TypeScript and Python SDK, public APIs allow users to programmatically interact with these components in any language. Sample use cases include logging data from a production AI app, running experiments, and retrieving evaluation result

Aggregate metrics and ranking criteria for Experiments

Added to All Experiments page. Aggregate metrics compile the metric values from individual traces in an experiment to show a combined value for each metric on the all experiments page. This enables you to quickly assess the performance of the underlying traces in an experiment. Ranking criteria allow you to determine which experiments were most successful by specifying a weighted average of the underlying metrics for each experiment.

Reference output and metadata availability

The reference output and metadata from the datasets are now available in the corresponding experiment traces so it can easily referenced.

Datasets and Playground

Enhanced playground inputs

to show complete dataset input rather than only variables so you can more flexibly define variable inputs.

Flatten to text in dataset upload

When uploading datasets from a CSV or JSON file, the contents of a column are automatically flattened to text instead of being stored as JSON when there’s only one file column mapped to an input, output or dataset column.

New Model in Playground and Metrics

Added Support for new GPT 4.1 model in playground and metrics.

SDK

G2.0 Typescript SDK improvements

Supporting Export types at the top-level (galileo/types), added a method to access the singleton logger.

General Usability

Performance optimization

Resolved performance issues causing occasional UI slowdowns, ensuring smoother and faster navigation.

Extended session durations

Reduce repetitive Google sign-ins, improving user convenience.

Support chat icon control

You now have the option to show or hide the support chat icon, customizing your interface according to your preferences. Previously, the support chat icon would overlap and cover key user interface elements. This change makes it easier to access the full user interface without the chat icon getting in the way.