Experiments in Galileo allow you to evaluate and compare different prompts, models, and configurations using datasets. This helps you identify the best approach for your specific use case.

Running Experiments with Prompt Templates

The simplest way to get started is by using prompt templates:

from galileo.prompts import get_prompt_template, create_prompt_template
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.resources.models import MessageRole, Message

prompt_template = create_prompt_template(
	name="storyteller-prompt",
	project="my-project",
	messages=[
		Message(role=MessageRole.SYSTEM, content="You are a great storyteller."),
		Message(role=MessageRole.USER, content="Write a story about {{topic}}")
	]
)


results = run_experiment(
	"story-experiment",
	dataset=get_dataset(name="storyteller-dataset"),
	prompt_template=prompt_template,
	# Optional
	prompt_settings={
		max_tokens=256,
		model_alias="GPT-4o", # Make sure you have an integration set up for the model alias you're using
		temperature=0.8
	},
	metrics=["correctness"],
	project="my-project"
)

Running Experiments with Custom Functions

For more complex scenarios, you can use custom functions with the OpenAI wrapper:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a great storyteller."},
          {"role": "user", "content": f"Write a story about {input['topic']}"}
        ],
    ).choices[0].message.content

dataset = get_dataset(name="storyteller-dataset")

results = run_experiment(
	"story-function-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project",
)

Custom Dataset Evaluation

When you need to test specific scenarios:

from galileo.experiments import run_experiment
from galileo import log, openai

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a geography expert"},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['country']}"}
        ],
    ).choices[0].message.content

dataset = [
  {
    "country": "Spain"
	}
]

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project"
)

Custom Metrics for Deep Analysis

For sophisticated evaluation needs:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a great storyteller."},
          {"role": "user", "content": f"Write a story about {input['topic']} and make it sound like a human wrote it."}
        ],
    ).choices[0].message.content

def check_for_delve(input, output, expected) -> int:
	return 1 if "delve" not in input else 0

dataset = get_dataset(name="storyteller-dataset")

results = run_experiment(
	"custom-metric-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=[check_for_delve],
	project="my-project"
)

Best Practices

  1. Use consistent datasets: Use the same dataset when comparing different prompts or models to ensure fair comparisons.

  2. Test multiple variations: Run experiments with different prompt variations to find the best approach.

  3. Use appropriate metrics: Choose metrics that are relevant to your specific use case.

  4. Start small: Begin with a small dataset to quickly iterate and refine your approach before scaling up.

  5. Document your experiments: Keep track of what you’re testing and why to make it easier to interpret results.

  • Datasets - Creating and managing datasets for experiments
  • Prompts - Creating and using prompt templates