Run an experiment against a RAG app

Experiments allow you to test your model, prompts, or application against a set of metrics using pre-defined inputs. This guide explains how to run an experiment against a RAG application, implemented using tool calling. The experiment will measure various metrics associated with tools and RAG. The core of this experiment is an application you can run to get a mock horoscope from a mocked RAG system, using Galileo logging to trace the LLM, tool, and RAG interactions. This example shows how you can take an existing application and run it through an experiment with a defined dataset of inputs. In this guide you will:

Set up your project with Galileo
Create a basic RAG application
Add logging with Galileo
Run your app using an experiment

Get the code

You can find the code in this how-to guide in the Galileo SDK examples repo.

Before you start

To complete this how-to, you will need:

An OpenAI API key
A Galileo project
An LLM integration configured to calculate LLM-as-a-judge metrics
Your Galileo API key

Install dependencies

To use Galileo, you need to install some package dependencies, and configure environment variables.

Install Required Dependencies

Install the required dependencies for your app. If you are using Python, create a virtual environment using your preferred method, then install dependencies inside that environment:

pip install "galileo[openai]" python-dotenv

Create a .env file, and add the following values

# Your Galileo API key
GALILEO_API_KEY="your-galileo-api-key"

# Your Galileo project name
GALILEO_PROJECT="your-galileo-project-name"

# The name of the Log stream you want to use for logging
GALILEO_LOG_STREAM="your-galileo-log-stream"

# Provide the console url below if you are using a
# custom deployment, and not using the free tier, or app.galileo.ai.
# This will look something like “console.galileo.yourcompany.com”.
# GALILEO_CONSOLE_URL="your-galileo-console-url"

# OpenAI properties
OPENAI_API_KEY="your-openai-api-key"

# Optional. The base URL of your OpenAI deployment.
# Leave this commented out if you are using the default OpenAI API.
# OPENAI_BASE_URL="your-openai-base-url-here"

# Optional. Your OpenAI organization.
# OPENAI_ORGANIZATION="your-openai-organization-here"

This assumes you are using a free Galileo account. If you are using a custom deployment, then you will also need to add the URL of your Galileo Console:

.env

GALILEO_CONSOLE_URL=your-Galileo-console-URL

Create a basic RAG application

In this section you will create a basic RAG application. This will use tool calling to call a function that simulates a RAG system by returning some pre-defined documents.

Create a file called app.py (Python) or app.ts (TypeScript)

Add imports

Add the following imports to your application file:

import json
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv(override=True)

Add a RAG function

Add a function that simulates RAG. This function takes a star sign, and returns some dummy documents with horoscopes.

# A mock RAG retriever function
def retrieve_horoscope_data(sign):
    """
    Mock function to simulate retrieving horoscope data for a given sign.
    """
    horoscopes = {
        "Aquarius": [
            "Next Tuesday you will befriend a baby otter.",
            "Next Tuesday you will find a dollar on the ground.",
        ],
        "Taurus": [
            "Next Tuesday you will find a four-leaf clover.",
            "Next Tuesday you will have a great conversation with a stranger.",
        ],
        "Gemini": [
            "Next Tuesday you will learn to juggle.",
            "Next Tuesday you will discover a new favorite book.",
        ],
    }
    return horoscopes.get(sign, ["No horoscope available."])

Add a tool to get the horoscope

Add code for a tool that can get the horoscope. This consists of a function that uses the RAG code to get horoscope information, and a tool definition that the LLM can use.

def get_horoscope(sign):
    """
    Tool function to get a horoscope for a given astrological sign.
    """
    return "\n".join(retrieve_horoscope_data(sign))

# Define a list of callable tools for the model
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_horoscope",
            "description": "Get today's horoscope for an astrological sign.",
            "parameters": {
                "type": "object",
                "properties": {
                    "sign": {
                        "type": "string",
                        "description": "An astrological sign like Taurus or Aquarius",
                    },
                },
                "required": ["sign"],
            },
        },
    },
]

# Map tool names to their implementations
available_tools = {"get_horoscope": get_horoscope}

Add code to interact with an LLM

Add the following code to interact with OpenAI:

# Create the OpenAI client
client = OpenAI()

def call_llm(messages):
    """
    Call the LLM with the provided messages and tools.
    """
    return client.chat.completions.create(
        model="gpt-5.1",
        tools=tools,
        messages=messages,
    )

Add a function to generate the horoscope using the LLM and tools

Add the following code to request the horoscope from OpenAI, calling tools as required:

def get_users_horoscope(sign: str) -> str:
    """
    Get the user's horoscope
    """
    # Create a running message history list we will add to over time
    message_history = [
        {
            "role": "system",
            "content": """
            You are a helpful assistant that provides horoscopes.
            Provide a flowery response based off any information retrieved.
            Include typical horoscope phrases, and characteristics of
            the sign in question.
            """,
        },
        {"role": "user", "content": f"What is my horoscope? I am {sign}."},
    ]

    # Prompt the model with tools defined
    response = call_llm(message_history)

    # Call any tools the model requested
    completion_tool_calls = response.choices[0].message.tool_calls

    if completion_tool_calls:
        # Add any tool calls to the message history
        message_history.append(
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": call.id,
                        "type": "function",
                        "function": {
                            "name": call.function.name,
                            "arguments": call.function.arguments,
                        },
                    }
                    for call in completion_tool_calls
                ],
            }
        )

        for call in completion_tool_calls:
            # Get the tool to call and its arguments
            tool_to_call = available_tools[call.function.name]
            args = json.loads(call.function.arguments)

            # Call the tool
            tool_result = tool_to_call(**args)

            # Add the tool result to the message history
            message_history.append(
                {
                    "role": "tool",
                    "content": tool_result,
                    "tool_call_id": call.id,
                    "name": call.function.name,
                }
            )

        # Now we call the model again, with the tool results included
        response = call_llm(message_history)

    # Return the final response from the model
    return response.choices[0].message.content

Add a main function

Add a main function to run the code:

def main():
    """
    Get the user's horoscope
    """
    response = get_users_horoscope("Aquarius")

    print(response)

if __name__ == "__main__":
    main()

Run your code

Run your app to ensure it is all working.

python app.py

You should see a fake horoscope in the output:

Aquarius, star-sparked visionary and airy water-bearer, your week shimmers with the quirky electricity of Uranus, your modern ruler. You thrive on fresh ideas, freedom, and the thrill of uplifting the collective—and the cosmos is conspiring to delight your inventive, humanitarian heart.

Expect a playful burst of serendipity by Tuesday. An encounter that feels like befriending a baby otter—sweet, curious, and disarmingly genuine—invites you to lead with wonder and soften any cool detachment. You may even stumble upon a tiny windfall, like a dollar on the ground, a winking reminder that value can arrive in small, unexpected packages. Take these omens as permission to say yes to spontaneity and to cherish the little joys that keep your big ideas buoyant.

In love and friendship: Share your eccentric sparkle without overexplaining. Your independence is magnetic when paired with a touch of warmth. Let your community know what you're building; allies are ready to rally.

In work and creativity: A future-forward insight wants to land. Prototype boldly, network widely, and trust your unconventional approach. If a door opens suddenly, step through—your sign excels at surfing the unexpected.

Soul care: Water rituals nourish now—walk by a river, take a long bath, or journal near a window after dusk. Give your brilliant mind a soft place to play.

Lucky glimmers: Tuesday's spontaneity, electric blue, silver accents, the numbers 1 and 11.
Aquarian mantra: “I pour possibility into the world, and the world pours wonder back to me.”

Add logging with Galileo

You now have a basic AI application with a tool that uses RAG. Let’s add logging with Galileo.

Add imports for Galileo

Add the following imports to the top of your application file:

from galileo import log, galileo_context

Create a session and trace

Change the main function to create a session and trace before generating the horoscope, then concluding and flushing it afterwards:

def main():
    """
    Get the user's horoscope
    """
    # Start a session and trace
    galileo_logger = galileo_context.get_logger_instance()
    galileo_logger.start_session("RAG with Tools Example")
    galileo_logger.start_trace(
        input="What is my horoscope? I am Aquarius.",
        name="Calling LLM with Tool"
    )

    response = get_users_horoscope("Aquarius")

    # Conclude the trace and flush
    galileo_logger.conclude(response)
    galileo_logger.flush()

    print(response)

Use the Galileo OpenAI wrapper to log an LLM span

Galileo has a wrapper for the OpenAI SDK that logs all LLM interactions as LLM spans.If you are using Python, update the OpenAI import to the following. If you are using TypeScript, update the creation of the OpenAI client to the following.

from galileo.openai import OpenAI

Log the RAG call as a retriever span

Update the mock RAG function to log the call and result as a retriever span:

@log(span_type="retriever")
def retrieve_horoscope_data(sign):
    ...

Log the tool call as a tool span

Update the get horoscope tool function to log the call and result as a tool span:

@log(span_type="tool")
def get_horoscope(sign):
    ...

Run your code

Run your app to ensure it is all working and logs to Galileo.

python app.py

View your Log stream in Galileo. You should see a session with a trace that has 2 LLM spans, a tool span, and a retriever span.

A trace with 2 LLM spans, a tool span, and a retriever span

Run your app using an experiment

Now that your app is running, let’s create an experiment file to run the app as an experiment. In this code you’ll create a new file to run as a test, but in the real world you would probably create this as a unit test using your preferred framework of choice.

Create a file called experiment.py (Python) or experiment.ts (TypeScript)

Add code to run the experiment

Add the following code to the experiment file to run the get horoscope function in an experiment using a dataset of star signs.

import os

from galileo import GalileoMetrics
from galileo.experiments import run_experiment

from app import get_users_horoscope

def main():
    """
    Run the horoscope experiment
    """
    # Define a dataset of astrological signs to use
    # in the experiment
    dataset = [
        {"input": "Aquarius"},
        {"input": "Taurus"},
        {"input": "Gemini"},
        {"input": "Leo"},
    ]

    # Run the experiment
    results = run_experiment(
        "horoscope-experiment",
        dataset=dataset,
        function=get_users_horoscope,
        metrics=[
            GalileoMetrics.tool_error_rate,
            GalileoMetrics.tool_selection_quality,
            GalileoMetrics.chunk_attribution_utilization,
            GalileoMetrics.context_adherence,
        ],
        project=os.environ["GALILEO_PROJECT"],
    )

    # Print a link to the experiment results
    print("Experiment Results:")
    print(results["link"])

if __name__ == "__main__":
    main()

This code runs the experiment, and calculates tool error rate, tool selection quality, chunk attribution and utilization, and context adherence.The run experiment function runs the get horoscope code in the experiment. This code has logging for the LLM call, tool call, and RAG retriever code, but does not create a trace or a session. These are created in the applications main function.This is the correct way to write code that you want to run in an experiment. For each row in the dataset used in the experiment, a new trace is created. If you try to create other traces inside the code that the experiment runs, then the experiment will raise an error.You should either create sessions and traces outside the code that is run in the experiment, or inside your code check to see if you are in an experiment, and if so, not create a session or trace. See our documentation on running experiments against custom functions for more details.

Run your code

Run your experiment code to create the experiment in Galileo.

python experiment.py

The terminal output will contain a link to the experiment:

Experiment Results:
https://app.galileo.ai/project/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx/experiments/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx

Follow the link to see the results of the experiment:

Next steps

Run experiments in code

Learn how to run experiments in unit tests that you can use during development, or in your CI/CD pipelines.

Run experiments in unit tests

Learn how to run experiments in unit tests that you can use during development, or in your CI/CD pipelines.

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

Run an experiment against a RAG app

Get the code

Before you start

Install dependencies

Create a basic RAG application

Add logging with Galileo

Run your app using an experiment

Next steps

Run experiments in code

Run experiments in unit tests

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

Get the code

​Before you start

​Install dependencies

​Create a basic RAG application

​Add logging with Galileo

​Run your app using an experiment

​Next steps

Run experiments in code

Run experiments in unit tests

Before you start

Install dependencies

Create a basic RAG application

Add logging with Galileo

Run your app using an experiment

Next steps