Datasets

Creating Datasets

You can create a new dataset using the create_dataset function:

from galileo.datasets import create_dataset

# Create a dataset with test data
test_data = [
    {
        "input": "Which continent is Spain in?",
        "output": "Europe",
    },
    {
        "input": "Which continent is Japan in?",
        "output": "Asia",
    },
]

dataset = create_dataset(
    name="countries",
    content=test_data
)

Getting Existing Datasets

You can retrieve an existing dataset using the get_dataset function:

from galileo.datasets import get_dataset

# Get a dataset by name
dataset = get_dataset(
    name="countries"
)

# Get a dataset by ID
dataset = get_dataset(
    id="dataset-id"
)

# Get its content
dataset.get_content()

Adding to Existing Datasets

You can add rows to an existing dataset using the add_rows method:

from galileo.datasets import get_dataset

# Get an existing dataset
dataset = get_dataset(
    name="countries"
)

# Add new rows to the dataset
dataset.add_rows([
    {
        "input": "Which continent is Morocco in?",
        "output": "Africa",
    },
    {
        "input": "Which continent is Australia in?",
        "output": "Oceania",
    },
])

Listing Datasets

You can list all available datasets using the list_datasets function:

from galileo.datasets import list_datasets

# List all datasets in a project
datasets = list_datasets()

# List datasets with a custom limit
datasets = list_datasets(
    limit=50,
)

Deleting Datasets

You can delete a dataset using the delete_dataset function:

from galileo.datasets import delete_dataset

# Delete a dataset by name
delete_dataset(
    name="countries",
    project="my-project",
)

# Delete a dataset by ID
delete_dataset(
    id="dataset-id",
    project="my-project",
) 

Working with Dataset Versions

Galileo automatically creates new versions of datasets when they are modified. You can access different versions:

from galileo.datasets import get_dataset

# Get the latest version by default
dataset = get_dataset(
    name="countries"
)

# Check when this version was last modified
print(dataset.modified_at)

Using Datasets in Experiments

Datasets are primarily used for running experiments to evaluate the performance of your LLM applications:

from galileo.datasets import get_dataset
from galileo.experiments import run_experiment
from galileo.prompts import get_prompt_template

# Get an existing dataset
dataset = get_dataset(
    name="countries"
)

# Get an existing prompt template
prompt_template = get_prompt_template(
    project="my-project",
    name="geography-prompt"
)

# Run an experiment with the dataset and prompt
results = run_experiment(
    "geography-experiment",
    dataset=dataset,
    prompt_template=prompt_template,
    metrics=["correctness"],
    project="my-project",
)

Best Practices for Dataset Management

When working with datasets in Galileo, consider these tips:

Start Small: Begin with a core set of representative test cases
Grow Incrementally: Add new test cases as you discover edge cases or failure modes
Use Consistent Formats: Maintain a consistent format for your datasets to make them easier to use
Include Expected Outputs: Always include expected outputs for evaluation
Document Your Datasets: Add descriptions and metadata to make it clear what each dataset is for

Overview

Getting Started

SDK/API

How-to Guides

Cookbooks

Concepts

Creating Datasets

Getting Existing Datasets

Adding to Existing Datasets

Listing Datasets

Deleting Datasets

Working with Dataset Versions

Using Datasets in Experiments

Best Practices for Dataset Management

Overview

Getting Started

SDK/API

How-to Guides

Cookbooks

Concepts

​Creating Datasets

​Getting Existing Datasets

​Adding to Existing Datasets

​Listing Datasets

​Deleting Datasets

​Working with Dataset Versions

​Using Datasets in Experiments

​Best Practices for Dataset Management

Creating Datasets

Getting Existing Datasets

Adding to Existing Datasets

Listing Datasets

Deleting Datasets

Working with Dataset Versions

Using Datasets in Experiments

Best Practices for Dataset Management