As you progress from initial testing to systematic evaluation, you’ll want to run experiments to validate your application’s performance and behavior. Here are several ways to structure your experiments, starting from the simplest approaches and moving to more sophisticated implementations.

Configure an LLM Integration

To calculate metrics, you will need to configure an integration with an LLM. Visit the relevant API platform to obtain an API key, then add it using the integrations page in the Galileo console.

Working with Prompts

The simplest way to get started with experimentation is by evaluating prompts directly against datasets. This is especially valuable during the initial prompt development and refinement phase, where you want to test different prompt variations. Assuming you’ve previously created a dataset, you can use the following code to run an experiment:

from galileo import Message, MessageRole
from galileo.prompts import get_prompt_template, create_prompt_template
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset

project = "my-project"

prompt_template = get_prompt_template(name="geography-prompt", project=project)
# If the prompt template doesn't exist, create it
if prompt_template is None:
    prompt_template = create_prompt_template(
        name="geography-prompt",
        project=project,
        messages=[
            Message(role=MessageRole.system, content="You are a geography expert. Respond with only the continent name."),
            Message(role=MessageRole.user, content=f"{input['input']}")
        ]
    )

results = run_experiment(
	"geography-experiment",
	dataset=get_dataset(name="countries"),
	prompt_template=prompt_template,
	# Optional
	prompt_settings={
		"max_tokens": 256,
		"model_alias": "GPT-4o", # Make sure you have an integration set up for the model alias you're using
		"temperature": 0.8
	},
	metrics=["correctness"],
	project=project
)

Running Experiments with Custom Functions

Once you’re comfortable with basic prompt testing, you might want to evaluate more complex parts of your app using your datasets. This approach is particularly useful when you have a generation function in your app that takes a set of inputs, which you can model with a dataset:

This example uses OpenAI as the LLM being evaluated, and for generating metrics.

Galileo is model-agnostic, and supports leading LLM providers including OpenAI, Azure OpenAI, Anthropic, and LLaMA.

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = get_dataset(name="countries")

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          {"role": "system", "content": "You are a geography expert."},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['input']}"}
        ],
    ).choices[0].message.content

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project",
)

Custom Dataset Evaluation

As your testing needs become more specific, you might need to work with custom or local datasets. This approach is perfect for focused testing of edge cases or when building up your test suite with specific scenarios:

from galileo.experiments import run_experiment
from galileo import log, openai

dataset = [
  {
    "input": "Spain"
	}
]

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a geography expert"},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['input']}"}
        ],
    ).choices[0].message.content

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project"
)

Custom Metrics for Deep Analysis

For the most sophisticated level of testing, you might need to track specific aspects of your application’s behavior. Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = [
  {
    "input": "Spain"
	}
]

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          {"role": "system", "content": "You are a geography expert."},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['input']}"}
        ],
    ).choices[0].message.content

def check_for_delve(input, output, expected) -> int:
	return 1 if "delve" not in input else 0

dataset = get_dataset(name="storyteller-dataset")

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=[check_for_delve],
	project="my-project"
)

Each of these experimentation approaches fits into different stages of your development and testing workflow. As you progress from simple prompt testing to sophisticated custom metrics, Galileo’s experimentation framework provides the tools you need to gather insights and improve your application’s performance at every level of complexity.

Experimenting with Agentic and RAG Applications

The experimentation framework extends naturally to more complex applications like agentic AI systems and RAG (Retrieval-Augmented Generation) applications. When working with agents, you can evaluate various aspects of their behavior, from decision-making capabilities to tool usage patterns. This is particularly valuable when testing how agents handle complex workflows, multi-step reasoning, or tool selection.

For RAG applications, experimentation helps validate both the retrieval and generation components of your system. You can assess the quality of retrieved context, measure response relevance, and ensure that your RAG pipeline maintains high accuracy across different types of queries. This is especially important when fine-tuning retrieval parameters or testing different reranking strategies.

The same experimentation patterns shown above apply to these more complex systems. You can use predefined datasets to benchmark performance, create custom datasets for specific edge cases, and define specialized metrics that capture the unique aspects of agent behavior or RAG performance. This systematic approach to testing helps ensure that your advanced AI applications maintain high quality and reliability in production environments.