Creating Datasets

You can create a new dataset using the create_dataset function:

from galileo.datasets import create_dataset

# Create a dataset with test data
test_data = [
    {
        "input": "Which continent is Spain in?",
        "output": "Europe",
    },
    {
        "input": "Which continent is Japan in?",
        "output": "Asia",
    },
]

dataset = create_dataset(
    name="countries",
    content=test_data
)

Getting Existing Datasets

You can retrieve an existing dataset using the get_dataset function:

from galileo.datasets import get_dataset

# Get a dataset by name
dataset = get_dataset(
    name="countries"
)

# Get a dataset by ID
dataset = get_dataset(
    id="dataset-id"
)

# Get its content
dataset.get_content()

Adding to Existing Datasets

You can add rows to an existing dataset using the add_rows method:

from galileo.datasets import get_dataset

# Get an existing dataset
dataset = get_dataset(
    name="countries"
)

# Add new rows to the dataset
dataset.add_rows([
    {
        "input": "Which continent is Morocco in?",
        "output": "Africa",
    },
    {
        "input": "Which continent is Australia in?",
        "output": "Oceania",
    },
])

Listing Datasets

You can list all available datasets using the list_datasets function:

from galileo.datasets import list_datasets

# List all datasets in a project
datasets = list_datasets()

# List datasets with a custom limit
datasets = list_datasets(
    limit=50,
)

Deleting Datasets

You can delete a dataset using the delete_dataset function:

from galileo.datasets import delete_dataset

# Delete a dataset by name
delete_dataset(
    name="countries",
    project="my-project",
)

# Delete a dataset by ID
delete_dataset(
    id="dataset-id",
    project="my-project",
) 

Working with Dataset Versions

Galileo automatically creates new versions of datasets when they are modified. You can access different versions:

from galileo.datasets import get_dataset

# Get the latest version by default
dataset = get_dataset(
    name="countries"
)

# Check when this version was last modified
print(dataset.modified_at)

Using Datasets in Experiments

Datasets are primarily used for running experiments to evaluate the performance of your LLM applications:

from galileo.datasets import get_dataset
from galileo.experiments import run_experiment
from galileo.prompts import get_prompt_template

# Get an existing dataset
dataset = get_dataset(
    name="countries"
)

# Get an existing prompt template
prompt_template = get_prompt_template(
    project="my-project",
    name="geography-prompt"
)

# Run an experiment with the dataset and prompt
results = run_experiment(
    "geography-experiment",
    dataset=dataset,
    prompt_template=prompt_template,
    metrics=["correctness"],
    project="my-project",
)

Best Practices for Dataset Management

When working with datasets in Galileo, consider these tips:

  1. Start Small: Begin with a core set of representative test cases
  2. Grow Incrementally: Add new test cases as you discover edge cases or failure modes
  3. Use Consistent Formats: Maintain a consistent format for your datasets to make them easier to use
  4. Include Expected Outputs: Always include expected outputs for evaluation
  5. Document Your Datasets: Add descriptions and metadata to make it clear what each dataset is for