Galileo Experiments allow you to evaluate and improve your LLM applications by running tests against datasets and measuring performance using various metrics.

Running an Experiment with a Prompt Template

The simplest way to get started is by using a prompt template. In the runExperiment options below, datasetName expects a dataset that you created through either the console or the SDK. Ensure you have saved a dataset before running the experiment!

import { createDataset, createPromptTemplate, runExperiment, getDataset } from "galileo";
import { MessageRole } from "galileo/dist/types/message.types";

async function runPromptTemplateExperiment() {
  const projectName = "my-project";

  const template = await createPromptTemplate({
    template: [
      { role: MessageRole.system, content: "You are a geography expert. Respond with only the continent name." },
      { role: MessageRole.user, content: "{{input}}" },
    ],
    projectName: projectName,
    name: "geography-prompt",
  });

  await runExperiment({
    name: "geography-experiment",
    datasetName: "geography-dataset", // Make sure you have a dataset created first
    promptTemplate: template,
    promptSettings: {
      max_tokens: 256,
      model_alias: "GPT-4o",
      temperature: 0.8,
    },
    metrics: ["correctness"],
    projectName: projectName,
  });
}

// Run the experiment
runPromptTemplateExperiment();

Running an Experiment with a Runner Function

For more complex scenarios, you can use a runner function. In this case, you may use either a saved dataset or a custom one.

import { runExperiment } from "galileo";
import { OpenAI } from "openai";

async function runFunctionExperiment() {
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const runner = async (input: any) => {
    const result = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [
        { role: "system", content: "You are a great storyteller." },
        { role: "user", content: `Write a story about ${input["topic"]}` },
      ],
    });
    return [result.choices[0].message.content];
  };

  await runExperiment({
    name: "story-function-experiment",
    datasetName: "storyteller-dataset",
    function: runner,
    metrics: ["correctness"],
    projectName: "my-project",
  });
}

// Run the experiment
runFunctionExperiment();

Running an Experiment with a Custom Dataset

When you need to test specific scenarios:

import { runExperiment } from "galileo";
import { OpenAI } from "openai";

async function runCustomDatasetExperiment() {
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const dataset = [{ input: "Spain", output: "Europe" }];

  const runner = async (input: any) => {
    const result = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [
        { role: "system", content: "You are a geography expert" },
        {
          role: "user",
          content: `Which continent does the following country belong to: ${input["input"]}`,
        },
      ],
    });
    return [result.choices[0].message.content];
  };

  await runExperiment({
    name: "geography-experiment",
    dataset: dataset,
    function: runner,
    metrics: ["correctness"],
    projectName: "my-project",
  });
}

// Run the experiment
runCustomDatasetExperiment();

Running an Experiment with Custom Metrics

For sophisticated evaluation needs:

// Coming soon…