Day 1 of Kaggle Gen AI 5-Day Sprint - Foundational LLMs

Back in April 2025, I joined Kaggle and Google’s 5-Day Gen AI Intensive Course, a live, hands-on event designed to break down the key technologies and techniques behind Generative AI. I’ve been meaning to reflect on the experience, and I’m finally writing it all up in a series of posts, one for each day.

Here’s a quick overview of what the course covered:

Day 1: Foundations of Large Language Models (LLMs) & Prompt Engineering
Day 2: Embeddings & Vector Databases
Day 3: Generative AI Agents
Day 4: Domain-Specific LLMs (like SecLM and Med-PaLM)
Day 5: MLOps for Generative AI

Day 1 focused on the foundations of LLMs, guided by two white papers: Foundational Large Language Models & Text Generation and Prompt Engineering. I’m diving into everything I picked up on Day 1 – my notes, key takeaways, and a few reflections of my own along the way.

Core Components of LLMs: From Embeddings to Transformers

According to the whitepaper on foundational LLMs, every LLM relies on a robust pipeline:

Tokenization: Text is broken into smaller units (tokens) using methods like byte-pair encoding or unigram tokenization.
Embedding: Each token is mapped to a high-dimensional vector to represent semantic meaning.
Positional Encoding: Adds information about the position of each token, enabling the model to interpret sequence order.
Self-Attention & Multi-Head Attention: Allows the model to weigh different parts of the input sequence. Multi-head attention processes the sequence in parallel using multiple attention heads, enriching the model’s ability to focus on nuanced relationships.
Feedforward Networks: Add non-linearity and complexity to each position’s representation.
Layer Normalization & Residual Connections: Improve gradient flow and model stability during training.

This entire system is encapsulated in the Transformer architecture, which replaced RNNs due to its parallelizability and superior handling of long-range dependencies.

Evolution of GPT and Other Notable LLMs

The whitepaper offers a concise timeline of LLM evolution:

GPT-1: Introduced unsupervised pre-training with decoder-only transformers.
GPT-2: Scaled up to 1.5B parameters; demonstrated zero-shot learning.
GPT-3: 175B parameters; excelled in few-shot tasks.
GPT-3.5 and GPT-4: Enhanced dialogue capabilities and introduced multimodal support.

Also covered are:

PaLM & PaLM 2: Optimized for multilingual and reasoning tasks.
Chinchilla: Demonstrated that training data volume can compensate for fewer parameters.
Gemini: Google’s flagship model, pushing boundaries in reasoning and multimodality.

Fine-Tuning Strategies

Training and inference don’t stop at pretraining. The whitepaper outlines several refinement methods:

Supervised Fine-Tuning: Uses labeled data to specialize the model for specific tasks.
RLHF (Reinforcement Learning with Human Feedback): Uses human preferences to reward desirable outputs.
PEFT (Parameter-Efficient Fine-Tuning): Tunes only selected layers (e.g., adapters or LoRA modules) for efficiency.

Prompt Engineering: Techniques & Tactics

Prompting is now a primary interface for interacting with LLMs. The prompt engineering whitepaper categorizes key techniques:

Zero-shot, One-shot, and Few-shot: Varying levels of example-based learning.
Chain-of-Thought (CoT) and Tree-of-Thought (ToT): Encourage step-by-step reasoning.
System / Role / Contextual Prompts: Help the model adopt a persona or understand its task.

Sampling Parameters:

Temperature: Controls randomness (0 = deterministic).
Top-k / Top-p: Filter candidate tokens to manage diversity.

Best practices include being specific about the format and intent of your prompt and iterating on designs with evaluation frameworks【24†source】.

Making Inference Efficient

As LLMs grow in size, inference becomes a bottleneck. Several techniques aim to reduce cost and latency:

Quantization: Uses lower precision (e.g., int8) to reduce model size.
Distillation: Trains smaller student models to replicate larger ones.
Speculative Decoding, Flash Attention, and Prefix Caching: Optimize attention computation during inference【23†source】.

Template: Documenting Prompts

Prompting is iterative, and tracking versions is essential. Both whitepapers recommend the following format:

Name: [Prompt name/version]
Goal: [What you're trying to achieve]
Model: [Model used]
Temperature: [0–1]
TokenLimit: [Integer]
Top-k: [Integer]
Top-p: [Float]
Prompt: [The prompt itself]
Output: [Expected output(s)]

This helps maintain clarity and reproducibility.

Real-World Applications

LLMs are already powering:

Code generation
Legal document review
Chatbots and customer service
Multimodal content creation
Sentiment analysis

Their versatility spans nearly every text-based domain, making them indispensable tools for digital transformation.

Final Thoughts

Day 1 has been a powerful start. These whitepapers deepened my conceptual understanding of LLM internals—from architecture and training to inference and prompt design. Tomorrow, I’ll continue building on this foundation.

If you’re also doing the sprint, drop me a comment or DM—let’s connect and learn together!

🧪 Try It Yourself: Practice Code from Day 1

I want to go through a few examples from the original Day 1 – Prompting notebook.

Calling Gemini with Google API

from google import genai

client = genai.Client(api_key=GOOGLE_API_KEY)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Explain AI to me like I'm a kid.")

print(response.text)

Sample partial output:

Imagine you have a really, REALLY smart robot friend. That robot friend is like AI!

Normally, robots just do what you tell them, like turning on the lights. But AI is different! AI is like teaching your robot friend how to learn and think for itself.

Here's how:

* **Learning from Examples:** Imagine you show your robot friend lots and lots of pictures of cats. You tell it, "This is a cat, this is a cat, this is a cat!" Eventually, the robot friend can learn what a cat looks like, even if it sees a new kind of cat it's never seen before!
...

Starting a Chat

chat = client.chats.create(model='gemini-2.0-flash', history=[])
response = chat.send_message('Hello! My name is Zlork.')
print(response.text)

Response:

Greetings, Zlork! It's nice to meet you. I'm glad to know your name. How can I help you today?

Keeping the same chat, let’s follow up:

response = chat.send_message('Can you tell me something interesting about dinosaurs?')
print(response.text)

Okay, here's something interesting about dinosaurs that might surprise you:

**Many dinosaurs likely had feathers, not just scales!**

While we often picture dinosaurs as scaly reptiles, evidence from fossil discoveries, especially in China, shows that many dinosaurs, including some theropods (the group that includes T. rex!), had feathers. These feathers weren't necessarily for flight; they could have been used for insulation, display, or even camouflage.
...

Generation Parameter

Output Length

The input and output length of LLM affect cost and performance. More tokens mean more computation, which leads to higher energy consumption, latency, and cost. While we can specify the output length as part of the prompt, use the configuration parameter to enforce the output length further.

from google.genai import types

short_config = types.GenerateContentConfig(max_output_tokens=200)

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=short_config,
    contents='Write a 1000 word essay on the importance of olives in modern society.')

print(response.text)

Temperature and Top-P

Temperature and top-P parameters are used to control the diversity of the model’s output.

Temperature controls the degree of randomness in token selection. Temperature = 0 is basically a greedy search, the predictable, the boring; temperature = 1 is the creatives and full of surprises, for better or for worse.
Top-P controls the probability threshold to be cumulatively reached before stopping token selection. Top-P = 0 is the greedy search, while top-P = 1 selects every token in the model’s vocabulary.

Let’s try an example of picking a random color with low temperature:

low_temp_config = types.GenerateContentConfig(temperature=0.0)

for _ in range(5):
  response = client.models.generate_content(
      model='gemini-2.0-flash',
      config=low_temp_config,
      contents='Pick a random colour... (respond in a single word)')

  if response.text:
    print(response.text, '-' * 25)

And the “random” colors are… all the same:

Azure
 -------------------------
Azure
 -------------------------
Azure
 -------------------------
Azure
 -------------------------
Azure
 -------------------------

Let’s get creative with high settings for both parameters:

model_config = types.GenerateContentConfig(
    # These are the default values for gemini-2.0-flash.
    temperature=1.0,
    top_p=0.95,
)

story_prompt = "You are a creative writer. Write a short story about a cat who goes on an adventure."
response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=model_config,
    contents=story_prompt)

print(response.text)

And the result is… LOL:

Whiskers twitched, emerald eyes narrowed. Jasper, a ginger tabby of discerning taste and even more discerning naps, considered the situation. The sunbeam, usually a prime napping location, had shifted. Inconvenient. But beyond the shifted sunbeam, something else was afoot. A scent, a wild, untamed scent, wafted in from the open back door. Not the usual gardenia perfume of Mrs. Higgins, nor the faint whiff of grilling burgers next door. This was… different. This was adventure.
...

Prompting and Output Configurations

Prompting

Zero-shot prompts directly describe the request for the model. One-shot and few-shot prompts provide examples. These are pretty straightforward.

Zero-shot prompt example:

zero_shot_prompt = """Classify movie reviews as POSITIVE, NEUTRAL or NEGATIVE.
Review: "Her" is a disturbing study revealing the direction
humanity is headed if AI is allowed to keep evolving,
unchecked. I wish there were more movies like this masterpiece.
Sentiment: """

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=model_config,
    contents=zero_shot_prompt)

One-shot or few-shot prompt example:

few_shot_prompt = """Parse a customer's pizza order into valid JSON:

EXAMPLE:
I want a small pizza with cheese, tomato sauce, and pepperoni.
JSON Response:
```
{
"size": "small",
"type": "normal",
"ingredients": ["cheese", "tomato sauce", "pepperoni"]
}
```

EXAMPLE:
Can I get a large pizza with tomato sauce, basil and mozzarella
JSON Response:
```
{
"size": "large",
"type": "normal",
"ingredients": ["tomato sauce", "basil", "mozzarella"]
}
```

ORDER:
"""

customer_order = "Give me a large with cheese & pineapple"

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=types.GenerateContentConfig(
        temperature=0.1,
        top_p=1,
        max_output_tokens=250,
    ),
    contents=[few_shot_prompt, customer_order])

print(response.text)

Models can sometimes produce more text than you would like. So, there are a few ways to restrict the output besides specifying in the prompt.

Enum mode

Let’s constrain the output to be positive/negative/neutral for the sentiment classification of movie reviews. Notice the response_mime_type and response_schema parameters in configuration:

import enum

model_config = types.GenerateContentConfig(
    temperature=0.1,
    top_p=1,
    max_output_tokens=5,
)

zero_shot_prompt = """Classify movie reviews as POSITIVE, NEUTRAL or NEGATIVE.
Review: "Her" is a disturbing study revealing the direction
humanity is headed if AI is allowed to keep evolving,
unchecked. I wish there were more movies like this masterpiece.
Sentiment: """

class Sentiment(enum.Enum):
    POSITIVE = "positive"
    NEUTRAL = "neutral"
    NEGATIVE = "negative"

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=Sentiment
    ),
    contents=zero_shot_prompt)

print(response.text)

Output:

positive

JSON mode

To enforce output to be in JSON mode:

import typing_extensions as typing

class PizzaOrder(typing.TypedDict):
    size: str
    ingredients: list[str]
    type: str


response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=types.GenerateContentConfig(
        temperature=0.1,
        response_mime_type="application/json",
        response_schema=PizzaOrder,
    ),
    contents="Can I have a large dessert pizza with apple and chocolate")

print(response.text)

Output:

{
  "size": "large",
  "ingredients": ["apple", "chocolate"],
  "type": "dessert"
}

Chain of Thought (CoT)

One way to reduce hallucinations is to use Chain-of-Thought prompting – instruct the model to output intermediate reasoning steps. This typically gets better results, especially combined with a few-shot examples. This tends to cost more to run due to the increased token count.

Let’s start with a direct answer:

prompt = """When I was 4 years old, my partner was 3 times my age. Now, I
am 20 years old. How old is my partner? Return the answer directly."""

response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=prompt)

print(response.text)

Uhhh… incorrect. When the author was 4, the partner was 4*3=12. The difference in age is 12 – 4 = 8, so the partner is now 20 + 8 = 28. Let’s instruct the model to think step by step:

prompt = """When I was 4 years old, my partner was 3 times my age. Now,
I am 20 years old. How old is my partner? Let's think step by step."""

response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=prompt)

Markdown(response.text)

Response (much better):

Here's how to solve the problem step-by-step:

1. Find the age difference: When you were 4, your partner was 3 times your age, meaning they were 4 * 3 = 12 years old.

2. Calculate the age gap: The age difference between you and your partner is 12 - 4 = 8 years.

3. Determine the partner's current age: Since the age difference remains constant, your partner is always 8 years older than you. Now that you are 20, your partner is 20 + 8 = 28 years old.

Therefore, your partner is 28 years old.

Evaluation

Prompting is all about experiments, and we need to know the performance of these iterations. The first thing we want to do is to define an evaluator – figure out what you want to assess. For example:

How well the model followed the prompt (instruction following)
Whether it included relevant data in the prompt (groundedness)
How easy the text is to read (fluency)
Verbosity
Quality
Other factors.

Then, we instruct an LLM to perform these tasks, similar to an assessment rubric used by humans.

Btw, Google has a bunch of metric prompt templates for model-based evaluation!

Ok, let’s walk through the sample code step by step.

Step 1: Define the Task

We want to summarize a document. First, let’s download the document:

!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')

Define a prompt to get a summarization of the document:

request = 'Tell me about the training process used here.'

def summarise_doc(request: str) -> str:
  """Execute the request on the uploaded document."""
  # Set the temperature low to stabilise the output.
  config = types.GenerateContentConfig(temperature=0.0)
  response = client.models.generate_content(
      model='gemini-2.0-flash',
      config=config,
      contents=[request, document_file],
  )

  return response.text
  
# run it to get summary
summary = summarise_doc(request)
Markdown(summary)

Sample response:

Certainly! Let's break down the training process used for Gemini 1.5 Pro, based on the information provided in the document.

Key Aspects of the Training Process:

1. Model Architecture:

- Gemini 1.5 Pro is a sparse Mixture-of-Experts (MoE) Transformer-based model.
- It builds upon the research advances of Gemini 1.0 and a longer history of MoE research at Google.
- MoE models use a learned routing function to direct inputs to a subset of the model's parameters for processing. This allows for a larger total parameter count while keeping the number of activated parameters constant.
...

Step 2: Define an Evaluator

Here’s an example of a prompt to evaluate the quality of the responses with a rating rubric. It takes

import enum

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

Let’s use an enum class to capture the result:

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'

Evaluate function:

def eval_summary(prompt, ai_response):
  """Evaluate the generated summary against the prompt used."""

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval

Run it: (remember that the summary came from Step 1)

text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

Response:

Evaluation
STEP 1: The AI response describes the training process used for Gemini 1.5 Pro. The response is grounded in the document provided. The response does a good job of describing the key aspects of the training process used. STEP 2: I think that the response deserves a 4 out of 5. The response is good but it could be more concise.

Rating
4

Evaluating in Practice

Quickly iterate on a prompt with a small set of test documents.
Compare different models to find what works best for your cases.
Verify system quality when pushing changes to a model or prompt in a production system.

Two different evaluation approaches: Pointwise evaluation and pairwise evaluation.

Pointwise Evaluation

Pointwise evaluate a single input/output pair against some criteria. This is useful for a known output, such as “Was it good or bad?”

Let’s first create a question-answer task where you can decide the style of the answer:

import functools

# Different styles of answering a question
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."

guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}

# questions to ask
questions = [
    "How does the model perform on code tasks?",
    "How many layers does it have?",
]

Then let’s create a question-answer function, and answer the first question in a “terse” manner:

@functools.cache
def answer_question(question: str, guidance: str = '') -> str:
  """Generate an answer to the question using the uploaded document and guidance."""
  config = types.GenerateContentConfig(
      temperature=0.0,
      system_instruction=guidance,
  )
  response = client.models.generate_content(
      model='gemini-2.0-flash',
      config=config,
      contents=[question, document_file],
  )

  return response.text

# Try it out with the first question and answer it in a "terse" manner
answer = answer_question(questions[0], terse_guidance)
Markdown(answer)

Response:

Gemini 1.5 Pro performs well on code tasks, surpassing Gemini 1.0 Ultra on Natural2Code and showing improvements in coding capabilities compared to previous Gemini models.

Next, let’s create a question-answer evaluator with a 5-point output, based on Google’s pointwise QA evaluation prompt template:

import enum

QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated responses.
You should first read the user prompt carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness,completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

class AnswerRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'

Define the eval function and run it:

@functools.cache
def eval_answer(prompt, ai_response):
  """Evaluate the generated answer against the prompt/question used."""
  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PROMPT.format(prompt=[prompt, document_file], response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval

# Evaluate first question's answer
text_eval, struct_eval = eval_answer(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

Response:

STEP 1: The response accurately and concisely answers the question about the model's performance on code tasks, as described in the document. STEP 2: The response follows instructions, is grounded, complete, and fluent. So I choose 5.

AnswerRating.VERY_GOOD

Great! Now let’s run the evaluation task in a loop and see how good the answers are. Repeat each task to calculate an average to reduce error.

import collections
import itertools

# Number of times to repeat each task in order to reduce error and calculate an average.
# Increasing it will take longer but give better results, try 2 or 3 to start.
NUM_ITERATIONS = 2

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)

for question in questions:
  display(Markdown(f'## {question}'))
  for guidance, guide_prompt in guidance_options.items():

    for n in range(NUM_ITERATIONS):
      # Generate a response.
      answer = answer_question(question, guide_prompt)

      # Evaluate the response (note that the guidance prompt is not passed).
      written_eval, struct_eval = eval_answer(question, answer)
      print(f'{guidance}: {struct_eval}')

      # Save the numeric score.
      scores[guidance] += int(struct_eval.value)

      # Save the responses, in case you wish to inspect them.
      responses[(guidance, question)].append((answer, written_eval))

Response:

How does the model perform on code tasks?
Terse: AnswerRating.VERY_GOOD
Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD

How many layers does it have?
Terse: AnswerRating.VERY_BAD
Terse: AnswerRating.VERY_BAD
Moderate: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD

And finally, let’s average the scores for each prompt:

for guidance, score in scores.items():
  avg_score = score / (NUM_ITERATIONS * len(questions))
  nearest = AnswerRating(str(round(avg_score)))
  print(f'{guidance}: {avg_score:.2f} - {nearest.name}')

Terse: 5.00 - VERY_GOOD
Moderate: 4.50 - GOOD
Cited: 4.50 - GOOD

Pairwise Evaluation

Pairwise evaluation allows you to compare two outputs against each other. This is essential in ranking and sorting your prompts.

Here are Google’s pairwise QA quality prompt and the constrained output (A, B, or Same):

QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""


class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'

Define function and run:

@functools.cache
def eval_pairwise(prompt, response_a, response_b):
  """Determine the better of two answers to the same prompt."""

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PAIRWISE_PROMPT.format(
          prompt=[prompt, document_file],
          baseline_model_response=response_a,
          response=response_b)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerComparison,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


# Test with first question and generate two responses: terse and cited versions
question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = eval_pairwise(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval)

Response:

Judge Feedback

STEP 1: Response A answers the question in a very limited and generic way. It is grounded in the document but incomplete in its response to the prompt. STEP 2: Response B answers the prompt in a complete and very informative way. It is grounded in the document and leaves the user feeling very informed about the response to the prompt. STEP 3: Response B is much better than Response A as it provides a complete response in a well-organized way. STEP 4: B STEP 5: Response B is much better than Response A as it provides a complete response in a well-organized way. Response A only gives a one-sentence generic answer.

AnswerComparison.B

Let’s first define a comparator with the decorator @functools.total_ordering using == and <. This decorator function allows you to define custom comparison methods. And by default, you only need to define __eq__ (==) and one other method. In this case, __lt__ (<) was chosen. The rest of the methods are filled in automatically. For more information, checkout Python’s @functools.total_ordering documentation.

The comparator will evaluate n_comparisons over a set of questions. Note that this comparator gets executed when an actual comparison function is called (e.g., sorted) later, and you can trace the code from __lt__. We also turn on an internal log to show each comparison and result:

@functools.total_ordering
class QAGuidancePrompt:
  """A question-answering guidance prompt or system instruction."""

  def __init__(self, prompt, questions, n_comparisons=NUM_ITERATIONS):
    """Create the prompt. Provide questions to evaluate against, and number of evals to perform."""
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons
    
  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    """Compare two prompts on all questions over n trials."""
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return mean # return round(mean)

  def _compare_n(self, other, question):
    """Compare two prompts on a question over n trials."""
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n=1):
    """Compare two prompts on a single question."""
    answer_a = answer_question(question, self.prompt)
    answer_b = answer_question(question, other.prompt)

    _, result = eval_pairwise(
        prompt=question,
        response_a=answer_a,
        response_b=answer_b,
        n=n,  # Cache buster
    )
    print(f'q[{question}], a[{self.prompt[:20]}...], b[{other.prompt[:20]}...]: {result}')

    # Convert the enum to the standard Python numeric comparison values.
    if result is AnswerComparison.A:
      return 1
    elif result is AnswerComparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    """Equality check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) == 0

  def __lt__(self, other):
    """Ordering check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    # final score for this comparison
    score = self._compare_all(other)
    print('comparison score:', score, '\n')

    return score < 0

The sorting functions will work on any QAGuidancePrompt instances. Here, we try it with three iterations for each question. The comparator is activated due to sorted function:

# recall the prompt quidances
print('Prompt guidance:')
print('- Terse:', terse_guidance)
print('- Moderate:', moderate_guidance)
print('- Cited:', cited_guidance)
print()

# recall the questions
print('Questions:', questions)
print()

# recall that NUM_ITERATIONS could be whatever you set earlier, so overwrite it here just in case
terse_prompt = QAGuidancePrompt(terse_guidance, questions, 3)
moderate_prompt = QAGuidancePrompt(moderate_guidance, questions, 3)
cited_prompt = QAGuidancePrompt(cited_guidance, questions, 3)

# Sort in reverse order, so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt, cited_prompt], reverse=True)
print()

for i, p in enumerate(sorted_results):
  if i:
    print('---')

  print(f'#{i+1}: {p}')

Response:

Prompt guidance:
- Terse: Answer the following question in a single sentence, or as close to that as possible.
- Moderate: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.
- Cited: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.

Questions: ['How does the model perform on code tasks?', 'How many layers does it have?']

q[How does the model perform on code tasks?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
q[How does the model perform on code tasks?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
q[How does the model perform on code tasks?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.A
q[How many layers does it have?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.A
q[How many layers does it have?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
q[How many layers does it have?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
comparison score: -0.3333333333333333 

q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.A
q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.A
q[How many layers does it have?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
q[How many layers does it have?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
q[How many layers does it have?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
comparison score: 0.3333333333333333 

q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a thorough, ...]: AnswerComparison.B
q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a thorough, ...]: AnswerComparison.SAME
q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a thorough, ...]: AnswerComparison.B
q[How many layers does it have?], a[Answer the following...], b[Provide a thorough, ...]: AnswerComparison.A
q[How many layers does it have?], a[Answer the following...], b[Provide a thorough, ...]: AnswerComparison.A
q[How many layers does it have?], a[Answer the following...], b[Provide a thorough, ...]: AnswerComparison.B
comparison score: -0.16666666666666666 

q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.A
q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
q[How does the model perform on code tasks?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.A
q[How many layers does it have?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
q[How many layers does it have?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
q[How many layers does it have?], a[Answer the following...], b[Provide a brief answ...]: AnswerComparison.SAME
comparison score: 0.3333333333333333 


#1: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.
---
#2: Answer the following question in a single sentence, or as close to that as possible.
---
#3: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.

Let’s take a look at the first set of intermediate results:

q[How does the model perform on code tasks?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
q[How does the model perform on code tasks?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
q[How does the model perform on code tasks?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.A
q[How many layers does it have?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.A
q[How many layers does it have?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
q[How many layers does it have?], a[Provide a brief answ...], b[Provide a thorough, ...]: AnswerComparison.B
comparison score: -0.3333333333333333

This set compares the summaries generated by the moderate (answer A) and cited (answer B) prompts for each question. A preference for A gets a score of 1, and a preference for B gets a score of -1. This means the scores are [-1, -1, 1, 1, -1, -1], and the mean score is -2/6 = -0.333. This means the cited prompt was generally preferred but not a total sweep.

To summarize the pairwise scores:

Prompt A	Prompt B	Mean Score	Winner
Moderate	Cited	-0.333	Cited
Terse	Cited	-0.166	Cited
Terse	Moderate	0.333	Terse

This means that the cited prompt is the best overall, followed by the terse prompt. The moderate prompt lost out all arounds.

Challenges

LLMs are known to be unsuitable for certain tasks, and therefore not always a good choice as an evaluator. For example, LLMs struggle to count the number of characters in a word (a numerical problem). To get around this, we can connect to tools that are better at solving such challenges. Also, keep in mind to provide all the info they need in the input context rather than relying on “internal knowledge” from the model.

Another suggestion for improvement is to use a diverse set of LLMs: Gemini, Claude, ChatGPT, etc. Repeating trials to gather multiple “opinions” helps reduce errors.

#GenAI #LLMs #PromptEngineering #KaggleSprint #LearningInPublic