Experiments in Galileo allow you to evaluate and compare different prompts, models, and configurations using datasets. This helps you identify the best approach for your specific use case.
Running Experiments with Prompt Templates
The simplest way to get started is by using prompt templates:
from galileo.prompts import get_prompt_template, create_prompt_template
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.resources.models import MessageRole, Message
prompt_template = create_prompt_template(
name="storyteller-prompt",
project="my-project",
messages=[
Message(role=MessageRole.SYSTEM, content="You are a great storyteller."),
Message(role=MessageRole.USER, content="Write a story about {{topic}}")
]
)
results = run_experiment(
"story-experiment",
dataset=get_dataset(name="storyteller-dataset"),
prompt_template=prompt_template,
prompt_settings={
max_tokens=256,
model_alias="GPT-4o",
temperature=0.8
},
metrics=["correctness"],
project="my-project"
)
Running Experiments with Custom Functions
For more complex scenarios, you can use custom functions with the OpenAI wrapper:
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai
def llm_call(input):
return openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a great storyteller."},
{"role": "user", "content": f"Write a story about {input['topic']}"}
],
).choices[0].message.content
dataset = get_dataset(name="storyteller-dataset")
results = run_experiment(
"story-function-experiment",
dataset=dataset,
function=llm_call,
metrics=["correctness"],
project="my-project",
)
Custom Dataset Evaluation
When you need to test specific scenarios:
from galileo.experiments import run_experiment
from galileo import log, openai
def llm_call(input):
return openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a geography expert"},
{"role": "user", "content": f"Which continent does the following country belong to: {input['country']}"}
],
).choices[0].message.content
dataset = [
{
"country": "Spain"
}
]
results = run_experiment(
"geography-experiment",
dataset=dataset,
function=llm_call,
metrics=["correctness"],
project="my-project"
)
Custom Metrics for Deep Analysis
For sophisticated evaluation needs:
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai
def llm_call(input):
return openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a great storyteller."},
{"role": "user", "content": f"Write a story about {input['topic']} and make it sound like a human wrote it."}
],
).choices[0].message.content
def check_for_delve(input, output, expected) -> int:
return 1 if "delve" not in input else 0
dataset = get_dataset(name="storyteller-dataset")
results = run_experiment(
"custom-metric-experiment",
dataset=dataset,
function=llm_call,
metrics=[check_for_delve],
project="my-project"
)
Best Practices
-
Use consistent datasets: Use the same dataset when comparing different prompts or models to ensure fair comparisons.
-
Test multiple variations: Run experiments with different prompt variations to find the best approach.
-
Use appropriate metrics: Choose metrics that are relevant to your specific use case.
-
Start small: Begin with a small dataset to quickly iterate and refine your approach before scaling up.
-
Document your experiments: Keep track of what you’re testing and why to make it easier to interpret results.
- Datasets - Creating and managing datasets for experiments
- Prompts - Creating and using prompt templates