As you progress from initial testing to systematic evaluation, you’ll want to run experiments to validate your application’s performance and behavior. Here are several ways to structure your experiments, starting from the simplest approaches and moving to more sophisticated implementations:

Working with Prompts

The simplest way to get started with experimentation is by evaluating prompts directly against datasets. This is especially valuable during the initial prompt development and refinement phase, where you want to test different prompt variations:

from galileo.prompts import get_prompt_template, create_prompt_template
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.resources.models import MessageRole, Message

prompt_template = create_prompt_template(
	name="storyteller-prompt",
	project="my-project",
	messages=[
		Message(role=MessageRole.SYSTEM, content="You are a great storyteller."),
		Message(role=MessageRole.USER, content="Write a story about {{topic}}")
	]
)


results = run_experiment(
	"story-experiment",
	dataset=get_dataset(name="storyteller-dataset"),
	prompt_template=prompt_template,
	# Optional
	prompt_settings={
		max_tokens=256,
		model_alias="GPT-4o", # Make sure you have an integration set up for the model alias you're using
		temperature=0.8
	},
	metrics=["correctness"],
	project="my-project"
)

Running Experiments with Custom Functions

Once you’re comfortable with basic prompt testing, you might want to evaluate more complex parts of your app using your datasets. This approach is particularly useful when you have a generation function in your app that takes a set of inputs, which you can model with a dataset:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a great storyteller."},
          {"role": "user", "content": f"Write a story about {input['topic']}"}
        ],
    ).choices[0].message.content

dataset = get_dataset(name="storyteller-dataset")

results = run_experiment(
	"story-function-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project",
)

Custom Dataset Evaluation

As your testing needs become more specific, you might need to work with custom or local datasets. This approach is perfect for focused testing of edge cases or when building up your test suite with specific scenarios:

from galileo.experiments import run_experiment
from galileo import log, openai

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a geography expert"},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['country']}"}
        ],
    ).choices[0].message.content

dataset = [
  {
    "country": "Spain"
	}
]

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project"
)

Custom Metrics for Deep Analysis

For the most sophisticated level of testing, you might need to track specific aspects of your application’s behavior. Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a great storyteller."},
          {"role": "user", "content": f"Write a story about {input['topic']} and make it sound like a human wrote it."}
        ],
    ).choices[0].message.content

def check_for_delve(input, output, expected) -> int:
	return 1 if "delve" not in input else 0

dataset = get_dataset(name="storyteller-dataset")

results = run_experiment(
	"custom-metric-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=[check_for_delve],
	project="my-project"
)

Each of these experimentation approaches fits into different stages of your development and testing workflow. As you progress from simple prompt testing to sophisticated custom metrics, Galileo’s experimentation framework provides the tools you need to gather insights and improve your application’s performance at every level of complexity.

Experimenting with Agentic and RAG Applications

The experimentation framework extends naturally to more complex applications like agentic AI systems and RAG (Retrieval-Augmented Generation) applications. When working with agents, you can evaluate various aspects of their behavior, from decision-making capabilities to tool usage patterns. This is particularly valuable when testing how agents handle complex workflows, multi-step reasoning, or tool selection.

For RAG applications, experimentation helps validate both the retrieval and generation components of your system. You can assess the quality of retrieved context, measure response relevance, and ensure that your RAG pipeline maintains high accuracy across different types of queries. This is especially important when fine-tuning retrieval parameters or testing different reranking strategies.

The same experimentation patterns shown above apply to these more complex systems. You can use predefined datasets to benchmark performance, create custom datasets for specific edge cases, and define specialized metrics that capture the unique aspects of agent behavior or RAG performance. This systematic approach to testing helps ensure that your advanced AI applications maintain high quality and reliability in production environments.