Experiments in Galileo allow you to evaluate and compare different prompts, models, and configurations using datasets. This helps you identify the best approach for your specific use case.

Running an Experiment with a Prompt Template

The simplest way to get started is by using a prompt template.

  • If you have an existing prompt template, you can fetch it by importing and using the get_prompt_template function from galileo.prompts.
  • The get_dataset function below expects a dataset that you created through either the console or the SDK. Ensure you have saved a dataset before running the experiment!
from galileo import Message, MessageRole
from galileo.prompts import create_prompt_template 
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset

project = "my-project"

# 1a. If the prompt template does not exist, create it:
prompt_template = create_prompt_template(
	name="geography-prompt",
	project=project,
	messages=[
		Message(role=MessageRole.system, content="You are a geography expert. Respond with only the continent name."),
		Message(role=MessageRole.user, content=f"{input['input']}")
	]
)

# 1b. (OPTIONAL) If the prompt template already exists, fetch it:
# prompt_template = get_prompt_template(name="geography-prompt", project=project)  

# 2. Run the experiment and get results
results = run_experiment(
	"geography-experiment",
	dataset=get_dataset(name="countries"), # Name of a dataset you created
	prompt_template=prompt_template,
	# Optional
	prompt_settings={
		"max_tokens": 256,
		"model_alias": "GPT-4o", # Make sure you have an integration set up for the model alias you're using
		"temperature": 0.8
	},
	metrics=["correctness"],
	project=project
)

Running Experiments with Custom Functions

For more complex scenarios, you can use custom functions with the OpenAI wrapper. Here, you may use either a saved dataset or a custom one

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = get_dataset(name="countries")

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          {"role": "system", "content": "You are a geography expert."},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['input']}"}
        ],
    ).choices[0].message.content

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project",
)

Custom Dataset Evaluation

When you need to test specific scenarios:

from galileo.experiments import run_experiment
from galileo import log, openai

dataset = [
  {
    "input": "Spain"
	}
]

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4",
        messages=[
          {"role": "system", "content": "You are a geography expert"},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['input']}"}
        ],
    ).choices[0].message.content

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["correctness"],
	project="my-project"
)

Custom Metrics for Deep Analysis

For sophisticated evaluation needs:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = [
  {
    "input": "Spain"
	}
]

def llm_call(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          {"role": "system", "content": "You are a geography expert."},
          {"role": "user", "content": f"Which continent does the following country belong to: {input['input']}"}
        ],
    ).choices[0].message.content

def check_for_delve(input, output, expected) -> int:
	return 1 if "delve" not in input else 0

dataset = get_dataset(name="storyteller-dataset")

results = run_experiment(
	"geography-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=[check_for_delve],
	project="my-project"
)

Best Practices

  1. Use consistent datasets: Use the same dataset when comparing different prompts or models to ensure fair comparisons.

  2. Test multiple variations: Run experiments with different prompt variations to find the best approach.

  3. Use appropriate metrics: Choose metrics that are relevant to your specific use case.

  4. Start small: Begin with a small dataset to quickly iterate and refine your approach before scaling up.

  5. Document your experiments: Keep track of what you’re testing and why to make it easier to interpret results.