Datasets are a fundamental building block in Galileo’s experimentation workflow. They provide a structured way to organize, version, and manage your test cases. Whether you’re evaluating prompts, testing application functionality, or analyzing model behavior, having well-organized datasets is crucial for systematic testing and continuous improvement.

Work with datasets

Datasets can be used in two ways:
  1. Using the Galileo Console
    • Create and manage datasets directly through the Galileo Console
    • Visually organize and track test cases
    • No coding required
  2. Using the Galileo SDK
    • Programmatically create and manage datasets using Python
    • Integrate dataset management into your existing workflows
    • Automate dataset operations
Choose the approach that best fits your workflow and team’s needs. Many users combine both approaches, using code for bulk operations and the console for visualization and quick edits. Each record in a Galileo dataset can have three top-level fields:
  1. Input - Input variables that can be passed to your application to recreate a test case.
  2. Reference Output - Reference outputs to evaluate your application. These can be the ground truth for BLEU, ROUGE, and Ground Truth Adherence metrics, or reference outputs for manual reference.
  3. Metadata - Additional data you can use to filter or group your dataset.

Create and manage datasets via the console

Create a new dataset

The dataset creation button, is your starting point for organizing test cases in Galileo’s interface. From the Datasets page of your project, click the + Create Dataset button. Dataset create dataset button You can also create a dataset from a Playground page. Click the Add Dataset button, then select + Create new dataset. Dataset creation from Playground You can choose to create a dataset by: Dataset dialog options

Dataset file uploads

An uploaded file can be in CSV, JSON/JSONC, or Feather format. The file needs to have at least one column that maps to the input values. These columns can have any name. Once you have uploaded the file, you can name the dataset, and map the columns in the file to the dataset’s input, reference output, and metadata by dragging them from the Original uploaded dataset column to the relevant dataset column. Select the Save dataset button when you are done. Mapping a column called input to the input column

Synthetic data generation

You can utilize Large Language Models (LLMs) to generate datasets that you can use to test your AI applications. These test datasets can be used before and after your app is deployed to production. This feature requires an integration with a supported LLM provider (for example, OpenAI, Azure, Mistral). To configure an integration, visit the LLM provider’s platform to obtain an API key, then add the key from the model selection dialog, or from Galileo’s integrations page. Synthetic data generation - LLM integrations To generate data, provide Input Examples for the AI model. At least one example is required, though more examples can help improve the synthetic data. Synthetic data generation - Generated Data After data generation is completed, select Save Dataset to continue working with the data (including editing, exporting, and sharing the data). You can also customize the generated data by setting:
  • The number of rows that you ask the LLM to generate.
  • The LLM model that you’re utilizing.
  • Your AI app’s use case (Optional): What task is your AI app doing? For example, chatbot to answer customer service questions.
  • Special instructions (Optional): Additional guidance to further refine the generated output.
  • The generated data types (Optional): Customize data types that the generated data should follow.
    Data types can be used for testing specific scenarios. For example, testing your app’s resilience to prompt injection scenarios where attackers try to get your app to produce harmful output.
Synthetic data generation - Customized Data Synthetically generated data can be used in many scenarios — expanding upon your existing datasets to increase test coverage and help you more quickly improve your AI applications.

Manual dataset creation

The console allows you to manually add and edit data rows. Select the Save dataset button when you are done. Dataset manual creation

Add rows to your dataset

You can manually add new rows to your dataset through the console, allowing you to capture problematic inputs or edge cases as you discover them. Adding a new row to an existing dataset After making changes to your dataset, select the Save changes button to create a new version that preserves your modifications while maintaining the history of previous versions.

View version history

The version history view allows you to track changes to your dataset over time, see when modifications were made, and access previous versions for comparison or regression testing. Dataset versions After we add a new row to the dataset, we can see the version history by clicking the Version History tab. Creating a new version of your dataset

Create and manage datasets via code

Create and grow your dataset

When building your test suite programmatically, you can create datasets using the Galileo SDK.
from galileo.datasets import create_dataset

test_data = [
    {
        "input": "Which continent is Spain in?",
        "output": "Europe",
    },
    {
        "input": "Which continent is Japan in?",
        "output": "Asia",
    },
]

dataset = create_dataset(
    name="countries",
    content=test_data
)
As you discover new test cases, you can add them to your dataset by running the following:
from galileo.datasets import get_dataset

dataset = get_dataset(
    name="countries"
)

dataset.add_rows([
    {
        "input": "Which continent is Morocco in?",
        "output": "Africa",
    },
    {
        "input": "Which continent is Australia in?",
        "output": "Oceania",
    },
])

Version management and history

One of the benefits of Galileo’s dataset management is automatic versioning. Automatic versioning allows you to track how your test suite evolves over time as well as ensures reproducibility of your experiments. You can always reference specific versions of a dataset or work with the latest version:
from galileo.datasets import get_dataset

# Get the latest version by default
dataset = get_dataset(
    name="countries"
)

# Check when this version was last modified
print(dataset.modified_at)

Access dataset variables in prompts

When you use datasets in Galileo, the attributes stored in the input in your dataset are made available to your prompts using mustache templating. This allows you to create dynamic prompts that adapt to the data in each row.

Example dataset

Suppose you have the following dataset:
test_data = [
    { "input": { "city": "Rome, Italy", "days": "5" } },
    { "input": { "city": "Paris, France", "days": "3" } },
]

Example prompt template

To reference fields from your dataset in your prompt, use double curly braces:
Prompt Template
Plan a {{ days }}-day travel itinerary for a trip to {{ city }}.
Include daily sightseeing activities, dining suggestions, and local experiences.
  • {{ city }} will be replaced with the value of the city field inside the input dictionary.
  • {{ days }} will be replaced with the value of the days field inside the input dictionary.

How it works

For each row in your dataset, Galileo will render the prompt template, replacing the variables with the corresponding values from the row. If a field is missing in a row, the variable will be empty, so ensure your dataset is consistent.

Create focus sets

When you find problems, you can create focused subsets of data:
  1. Create subsets of data that trigger specific issues
  2. Track how well your fixes work on these subsets
  3. Make sure fixes don’t cause new problems
  4. Build a library of test cases for future testing
This can be done either through the console or programmatically, depending on your workflow.

Best practices for dataset management

When working with datasets consider these tips:
  1. Start Small and Representative
  • Why: Beginning with a core set of representative test cases helps you quickly validate your workflow and catch obvious issues before scaling up.
  • How: Select a handful of diverse, meaningful examples that reflect the range of inputs your model will see.
  1. Grow Incrementally
  • Why: Add new test cases as you discover edge cases or failure modes. This ensures your dataset evolves alongside your understanding of the problem.
  • How: Whenever you encounter a new bug, edge case, or user scenario, add it to your dataset.
  1. Version Thoughtfully
  • Why: Versioning lets you track major changes, reproduce past experiments, and understand how your test suite evolves.
  • How: Create a new version when you make significant changes, and use version history to compare results over time.
  1. Document Changes
  • Why: Keeping a record of why you added certain test cases or created new versions helps future you (and your teammates) understand the reasoning behind your dataset’s evolution.
  • How: Use comments, changelogs, or dataset descriptions to note the purpose of additions or modifications.
  1. Organize by Purpose
  • Why: Separate datasets for different types of tests (e.g., basic functionality, edge cases, regression tests) make it easier to target specific goals and analyze results.
  • How: Create and name datasets according to their intended use.
  1. Choose the Right Approach
  • Why: The console is great for quick edits and visual exploration, while code is better for automation and bulk operations.
  • How: Use both as needed. Use the console for ad hoc changes, SDK for systematic or large-scale updates.
  1. Track Progress
  • Why: Monitoring how changes affect both specific issues and overall performance helps you measure improvement and catch regressions.
  • How: Use metrics, dashboards, or manual review to assess the impact of dataset and prompt changes.
  1. Keep History
  • Why: Saving problematic inputs and maintaining version history prevents regressions and helps you understand past issues.
  • How: Never delete old test cases—archive or version them instead.
  1. Keep Your Dataset Schema Consistent
  • Why: Inconsistent schemas (e.g., missing fields) can cause prompt rendering errors or unexpected results.
  • How: Ensure every row contains all fields referenced in your prompt templates.
  1. Use Nested Access for Dictionaries
  • Why: Many real-world datasets have nested structures. Dot notation (e.g., input.metadata.days) lets you access these fields cleanly in your prompt templates.
  • How: Reference nested fields using dot notation in your prompt templates.
  1. Test Your Prompt Templates
  • Why: Testing ensures that variables are replaced as expected and helps catch typos or missing fields before running large experiments.
  • How: Render your prompt with a row input to verify correct variable substitution.
  1. Document Your Prompt Templates
  • Why: Clear documentation of which fields are used in your prompt templates helps maintainers and collaborators understand dependencies between your data and prompts.
  • How: Add comments or documentation near your prompt templates explaining expected input fields.
By following these practices and using Galileo’s dataset management features, you can build a robust and maintainable test suite that grows with your application’s needs.

Summary

Galileo’s dataset management capabilities provide a foundation for systematic testing and continuous improvement of your AI applications. With two distinct paths for creating and managing datasets—through the console or programmatically—choose the approach that best fits your workflow and team’s needs. By leveraging datasets and the best practices provided, you can build a comprehensive test suite that helps you identify and address issues before they impact your users.