Galileo Experiments allow you to evaluate and improve your LLM applications by running tests against datasets and measuring performance using various metrics.

Running an Experiment with a Prompt Template

The simplest way to get started is by using a prompt template:

import { createPromptTemplate, runExperiment } from "galileo";

async function runPromptTemplateExperiment() {
  const template = await createPromptTemplate({
    template: [
      { role: "system", content: "You are a great storyteller." },
      { role: "user", content: "Write a story about {{topic}}" },
    ],
    projectName: "my-project",
    name: "storyteller-prompt",
  });

  await runExperiment({
    name: "story-experiment",
    datasetName: "storyteller-dataset",
    promptTemplate: template,
    metrics: ["correctness"],
    projectName: "my-project",
  });
}

// Run the experiment
runPromptTemplateExperiment();

Running an Experiment with a Runner Function

For more complex scenarios, you can use a runner function:

import { runExperiment } from "galileo";
import { OpenAI } from "openai";

async function runFunctionExperiment() {
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const runner = async (input) => {
    const result = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [
        { role: "system", content: "You are a great storyteller." },
        { role: "user", content: `Write a story about ${input["topic"]}` },
      ],
    });
    return result;
  };

  await runExperiment({
    name: "story-function-experiment",
    datasetName: "storyteller-dataset",
    runner: runner,
    metrics: ["correctness"],
    projectName: "my-project",
  });
}

// Run the experiment
runFunctionExperiment();

Running an Experiment with a Custom Dataset

When you need to test specific scenarios:

import { runExperiment } from "galileo";
import { OpenAI } from "openai";

async function runCustomDatasetExperiment() {
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const dataset = [{ input: "Spain", expected: "Europe" }];

  const runner = async (input) => {
    const result = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [
        { role: "system", content: "You are a geography expert" },
        {
          role: "user",
          content: `Which continent does the following country belong to: ${input["input"]}`,
        },
      ],
    });
    return result;
  };

  await runExperiment({
    name: "geography-experiment",
    dataset: dataset,
    function: runner,
    metrics: ["correctness"],
    projectName: "my-project",
  });
}

// Run the experiment
runCustomDatasetExperiment();

Running an Experiment with Custom Metrics

For sophisticated evaluation needs:

// Coming soon…