Generative AI

Benchmarking LLMs: A Deep Dive into the Effects of Prompt Sensitivity

📘Access the notebook.

In a previous post, I showed you how to run LangChain Benchmarks using local models from Hugging Face.

I recommend skimming that post first, so you have some context.


Here’s what this blog is about: The Nuances Involved in Benchmarking LLMs

I want to provide a hands-on, detailed examination of the nuances involved in benchmarking Large Language Models (LLMs) using LangChain Benchmarks.

My focus has several key aspects:

  1. Experimentation with Different Prompts and Setups: I’m conducting experiments by altering prompts and configurations to observe how these changes affect the LLM’s performance.
  2. Analyzing the Impact of Variations in Prompts: My observations on how different elements like the prefix, suffix, and the structure of the human message template influence the LLM’s responses are vital. I hope to shed light on the sensitivity of LLMs to input variations and how they can be gamed for specific tasks or outcomes.
  3. Practical Application of Findings: The insights I provide are not just theoretical; they have practical implications for those looking to employ LLMs in various contexts. Understanding how small changes in input can lead to significant differences in output is essential for developers and researchers working with these models.
  4. Contribution to the LLM Community: By sharing these insights and observations, I hope to contribute to the broader community of LLM users and developers. I hope this aids in building a deeper understanding of how these models can be effectively utilized and what factors need to be considered in their deployment.

%%capture
!pip install -qq langchain openai datasets langchain_benchmarks langsmith langchainhub
import os
import getpass
nest_asyncio.apply()

os.environ['LC_ALL'] = 'en_US.UTF-8'
os.environ['LANG'] = 'en_US.UTF-8'
os.environ['LC_CTYPE'] = 'en_US.UTF-8'
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key:")
Enter OpenAI API Key:··········
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter LangChain API Key:")
Enter LangChain API Key:··········
from langchain_benchmarks import clone_public_dataset
from langchain_benchmarks import registry

from langchain_benchmarks.schema import ExtractionTask
from langchain_benchmarks.tool_usage.agents import apply_agent_executor_adapter


🤖 The Impact of Prompts on Agent Tool Usage

Experimenting with prompts was the most time-intensive aspect of my LangChain Benchmarks project.

What I discovered was striking: both the DeciLM and Mistral models exhibit dramatic variances in response to even minute alterations in the prompt. Surprisingly, a simple addition like an extra whitespace could lead to a range of outcomes – incorrect responses, a hundredfold increase in response time, or even triggering errors in the AgentExecutor. This sensitivity to prompt nuances opens a window for ‘prompt hacking,’ potentially skewing evaluation results to favor one model over another. Despite this, my focus remained on crafting prompts that elicited the best possible performance from each model.

However, I confess to spending an excessive amount of time experimenting with prompts to achieve rapid and accurate responses from the models. This highlights a significant challenge in evaluating LLMs: prompt sensitivity.

Initially, I crafted prompts from scratch until I realized the potential of modifying sections of a standard prompt template using agent_kwargs. One notable oversight in the documentation was the exclusion of benchmark-specific tasks in the prompts, which I addressed by incorporating instructions = self.task.instructions directly into the prompts.

Regarding Mistral, its model card lacked specific formatting instructions. It did indicate the need to start prompts with <s>[INST] and conclude them with [/INST]. For DeciLM, which underwent instruction tuning, the template I used was as follows:

So, I tested a variety of prompts, with and without these instructions and special tokens. The analysis I do here will simply look at the number of correct, incorrect, and errored out results from each model on each prompt. I enourage you to go and drill down into the results, and come to your own conlusions. I’m not making a statement here that one model is better than the other, that’s up to the community to decide. I’m just sharing my work.

Let’s take a look at the results.


Note: Tracing for all runs are publically available here.


Pull data from LangSmith

# @title Pull data from LangSmith

import pandas as pd
from langsmith.client import Client

def extract_steps_from_intermediate(steps):
    """
    Extract the 'tool' values from the first element of each step in a nested list structure.

    This function is designed to process a list of lists, where each inner list represents a step
    and is expected to contain at least one dictionary with a key 'tool'. It extracts the 'tool' value
    from the first dictionary of each step.

    Args:
    steps (list): A list of lists, where each inner list represents a step.

    Returns:
    list: A list of extracted 'tool' values from the first element of each step.
    """
    # Check if the input is a list; if not, return an empty list
    if not isinstance(steps, list):
        return []

    extracted_tools = []
    # Iterate through each step in the list
    for step in steps:
        # Check if the step is a non-empty list and contains the key 'tool' in its first element
        if isinstance(step, list) and step and "tool" in step[0]:
            extracted_tools.append(step[0]["tool"])

    return extracted_tools

def concatenate_test_results(projects, project_names):
    """
    Concatenates test results from specified projects into a single DataFrame.

    This function processes a list of projects, filters for specific project names provided,
    retrieves their test results, and concatenates these results into a single DataFrame.
    The function is designed to work with projects from the LangSmith API.

    Args:
    projects : list
        A list of projects obtained from the LangSmith API. Each project is an object
        which should have at least 'name' and 'extra' attributes.
    project_names : list of str
        A list of project names to filter by. Only projects with these names will have
        their test results included in the final DataFrame.

    Returns:
    pd.DataFrame
        A pandas DataFrame containing the concatenated test results from the specified projects.
        Each row in the DataFrame corresponds to a record-level information from a test project.
        Note: the data is fetched from the DB and results might not be immediately available upon
        evaluation run completion.

    """
    dfs = []
    for project in projects:
        if project.name in project_names:
            test_results = client.get_test_results(project_name=project.name)
            test_results["model"] = project.extra['tags'][0]
            dfs.append(test_results)
    df = pd.concat(dfs)
    df["actual_steps"] = df["outputs.intermediate_steps"].apply(extract_steps_from_intermediate)
    df["num_expected_steps"] = df["reference.expected_steps"].apply(len)
    return df


Helper functions for analysis

# @title Helper functions for analysis
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

def count_responses_by_type(data):
    """
    Count the number of correct, incorrect, and errored responses for each model.

    Args:
    data (DataFrame): The dataset containing 'model' and 'feedback.correctness' columns.

    Returns:
    DataFrame: Counts of response types (Correct, Incorrect, Error) for each model.
    """
    # Categorizing responses based on 'feedback.correctness'
    data['Response Type'] = data['feedback.correctness'].apply(lambda x: 'Correct' if x == 1.0 else ('Incorrect' if x == 0.0 else 'Error'))

    # Counting the number of each type of response for each model
    return data.groupby(['model', 'Response Type']).size().unstack(fill_value=0)

def plot_response_counts(response_counts):
    """
    Plot the response counts as a stacked bar chart.

    Args:
    response_counts (DataFrame): A DataFrame containing counts of response types (Correct, Incorrect, Error) for each model.

    This function will generate a bar plot showing the count of each response type for each model.
    """
    # Reorder columns to ensure the stacking order is Correct, Incorrect, Error
    ordered_columns = ['Correct', 'Incorrect', 'Error']
    response_counts = response_counts[ordered_columns]

    ax = response_counts.plot(kind='bar', stacked=True, color={'Correct': 'green', 'Incorrect': 'red', 'Error': 'yellow'})

    # Add text labels for actual counts on top of the bars
    for p in ax.containers:
        ax.bar_label(p, label_type='center', fontsize=16)

    plt.title('Count of Response Types by Model')
    plt.xlabel('Model')
    plt.ylabel('Count')
    plt.show()

def summarize_execution_time_by_response_type(data):
    """
    Calculate summary statistics for execution time by model and response type, expressed in minutes.

    Args:
    data (DataFrame): The dataset containing 'execution_time', 'model', and 'feedback.correctness' columns.

    Returns:
    DataFrame: Summary statistics of execution time for each model by response type in minutes.
    """
    # Convert execution time from seconds to minutes
    data['Execution Time (minutes)'] = data['execution_time'] / 60

    # Categorizing responses based on 'feedback.correctness'
    data['Response Type'] = data['feedback.correctness'].apply(lambda x: 'Correct' if x == 1.0 else ('Incorrect' if x == 0.0 else 'Error'))

    # Ensure the order of 'Response Type' is Correct, Incorrect, Error
    response_type_order = {'Correct': 0, 'Incorrect': 1, 'Error': 2}
    data['Response Type Order'] = data['Response Type'].map(response_type_order)

    # Grouping the data by 'model' and 'Response Type' and calculating the summary statistics for execution time in minutes
    summary = data.groupby(['model', 'Response Type Order', 'Response Type'])['Execution Time (minutes)'].describe()

    # Dropping the 'Response Type Order' from the index to clean up the output
    summary = summary.droplevel('Response Type Order')

    return summary

def plot_execution_time_by_correctness_minutes(data):
    """
    Create box plots for execution time distribution by model, with distinct colors for each response type:
    Correct (green), Incorrect (red), and Error (yellow), expressed in minutes.

    Args:
    data (DataFrame): The dataset containing 'execution_time', 'model', and 'feedback.correctness' columns.

    This function creates box plots with the y-axis on a logarithmic scale for better readability.
    It distinguishes between correct, incorrect, and error responses for each model with distinct colors,
    and converts execution time from seconds to minutes.
    """
    # Convert execution time from seconds to minutes
    data['Execution Time (minutes)'] = data['execution_time'] / 60

    # Categorizing responses based on 'feedback.correctness'
    data['Response Type'] = data['feedback.correctness'].apply(lambda x: 'Correct' if x == 1.0 else ('Incorrect' if x == 0.0 else 'Error'))

    # Sorting data based on 'Response Type' and 'model' to ensure consistent plot order
    response_type_order = ['Correct', 'Incorrect', 'Error']
    model_order = ['DeciLM', 'Mistral']
    data['Response Type'] = pd.Categorical(data['Response Type'], categories=response_type_order, ordered=True)
    data['model'] = pd.Categorical(data['model'], categories=model_order, ordered=True)

    plt.figure(figsize=(12, 8))

    # Set font sizes
    plt.rcParams.update({'font.size': 14})

    sns.boxplot(x='model', y='Execution Time (minutes)', hue='Response Type', data=data,
                palette={'Correct': 'green', 'Incorrect': 'red', 'Error': 'yellow'})
    plt.title('Execution Time Distribution by Model and Response Type (Minutes)')
    plt.ylabel('Execution Time (minutes)')
    plt.xlabel('Model')
    plt.yscale('log')  # Using a logarithmic scale for better readability
    plt.legend(title='Response Type')
    plt.grid(True)
    plt.show()
client = Client()

projects = list(client.list_projects(reference_dataset_name="Multiverse Math"))


Run 256b

Here is a public link so you can go and examine the results for yourself.

In this run, I used the following prompt:

Note that instructions in the prompt refers to the instructions for the Multiverse Math task, which is as follows:

You are requested to solve math questions in an alternate mathematical universe. The operations have been altered to yield different results than expected. Do not guess the answer or rely on your innate knowledge of math. Use the provided tools to answer the question. While associativity and commutativity apply, distributivity does not. Answer the question using the fewest possible tools. Only include the numeric response without any clarifications.

This prompt ended up with the closest results for both models. I’ll refer to this as the reference prompt. Any changes made to a prompt will be considered in relation to this prompt.

run_256b_project_names = ["DeciLM-Multiverse Math-2023-12-30-256b", "Mistral-Multiverse Math-2023-12-30-256b"]

run_256b_df = concatenate_test_results(projects, run_256b_project_names)
plot_response_counts(count_responses_by_type(run_256b_df))

Execution time in minutes.

summarize_execution_time_by_response_type(run_256b_df)
countmeanstdmin25%50%75%max
modelResponse Type
DeciLMCorrect8.018.46631139.1338682.2261442.4237644.5757907.809745115.115817
Incorrect3.0434.874290151.770921266.034112372.328573478.623034519.294378559.965723
Error9.030.31047242.6005861.8354232.4611264.02309956.086903103.178992
MistralCorrect5.0104.13919357.00470512.908443101.369135113.749601123.815598168.853191
Incorrect3.013.68628610.8552751.3571309.62535917.89358819.85086421.808140
Error12.06.5238692.9378112.7250334.3478286.2474297.11405012.191421
plot_execution_time_by_correctness_minutes(run_256b_df)


Run 0b56

Here is a public link so you can go and examine the results for yourself.

In this run, I used the following prompt:


Differences compared to Prompt 256b

  1. Formatting Tags:
    • Prompt 256b: No special formatting tags.
    • Prompt 0b56: Includes formatting tags (<s>[INST] and [/INST]).
  2. Suffix Emphasis:
    • Prompt 256b: Emphasizes “ALWAYS respond with the following format”.
    • Prompt 0b56: Highlights “Remember you are following a specific set of instructions”.
  3. User Instruction:
    • Prompt 256b: Asks to “Use the correct tool to answer the following question”.
    • Prompt 0b56: Directs to “Use the tools correctly and answer the following question”.

The main differences are the inclusion of formatting tags in Prompt 0b56 and slight variations in the phrasing that emphasize specific instruction adherence and correct tool usage.

run_0b56_project_names = ["DeciLM-Multiverse Math-2024-01-04-0b56",
                          "Mistral-Multiverse Math-2024-01-04-0b56"]

run_0b56_df = concatenate_test_results(projects, run_0b56_project_names)
plot_response_counts(count_responses_by_type(run_0b56_df))
summarize_execution_time_by_response_type(run_0b56_df)
countmeanstdmin25%50%75%max
modelResponse Type
DeciLMCorrect7.05.0894072.3281232.4608083.3266394.1643637.1922647.962867
Incorrect2.040.18023442.56348910.08330325.13176840.18023455.22870070.277166
Error11.015.96655822.4119181.9248872.6487873.51949118.85408671.468078
MistralCorrect1.09.649047NaN9.6490479.6490479.6490479.6490479.649047
Incorrect10.047.15976651.8194982.8337518.98668023.94219888.200453132.290339
Error9.027.60403829.6270806.0891476.91816117.89310525.51332686.015022
plot_execution_time_by_correctness_minutes(run_0b56_df)


Run a0e6

Here is a public link so you can go and examine the results for yourself.

In this run, I used the following prompt:


Differences compared to Prompt 256b

  1. Prefix:
    • Prompt a0e6: Adds “You must use these tools, each accompanied by a clear description and specific input requirements:” to the instructions.
    • Prompt 256b: Simply instructs to “Use one of the following tools to take action:”.
  2. Suffix:
    • Prompt a0e6: The format is “Observation:\nThought:\nAction:$JSON_BLOB“.
    • Prompt 256b: The format is “Action: \n$JSON_BLOB\n \nObservation:\nThought:\n”. The order of Action, Observation, and Thought sections are reversed.
  3. Human Message Template:
    • Prompt a0e6: States “Use the tools correctly and answer the following question: {input}”.
    • Prompt 256b: Says “Use the correct tool to answer the following question: {input}”.

The key differences are in the additional emphasis on using tools with clear descriptions and specific input requirements in a0e6, the reversal of the order in the suffix section, and slightly different phrasing in the human message template.

run_a0e6_project_names = ["DeciLM-Multiverse Math-2024-01-04-a0e6",
                          "Mistral-Multiverse Math-2024-01-04-a0e6"]

run_a0e6_df = concatenate_test_results(projects, run_a0e6_project_names)

plot_response_counts(count_responses_by_type(run_a0e6_df))
summarize_execution_time_by_response_type(run_a0e6_df)
countmeanstdmin25%50%75%max
modelResponse Type
DeciLMCorrect2.02.0776810.3976201.7965211.9371012.0776812.2182612.358840
Incorrect16.024.93295041.0465240.8794671.9035642.41157924.581974112.956007
Error2.06.0322374.9637652.5223254.2772816.0322377.7871939.542149
MistralIncorrect20.029.62096242.3320991.1596205.1610737.05804429.775293114.148989
plot_execution_time_by_correctness_minutes(run_a0e6_df)


Run b689

Here is a public link so you can go and examine the results for yourself.

In this run, I used the following prompt:


Differences compared to Prompt 256b

  1. Prefix:
    • Prompt b689: Starts with <s>[INST] and specifies “You must use the following tools to take action:”.
    • Prompt 256b: No special formatting tags and simply states “Use one of the following tools to take action:”.
  2. Suffix:
    • Prompt b689: The format is “Note:: {instructions} Respond in the following format: \nObservation:\nThought:\nAction:$JSON_BLOB“.
    • Prompt 256b: The format is “Remember: {instructions} ALWAYS respond with the following format: \nAction: \n$JSON_BLOB\n \nObservation:\nThought:\n”, with a different ordering of the sections.
  3. Human Message Template:
    • Prompt b689: Instructs “Think step-by-step, correctly use your tools, double check that you used the tool correctly, and solve the following problem: {input}”.
    • Prompt 256b: Simply states “Use the correct tool to answer the following question: {input}”.

The main differences are in the inclusion of formatting tags and additional instructions in Prompt b689, a different order in the suffix, and more detailed guidance for tool usage in the human message template.

run_b689_project_names = ["DeciLM-Multiverse Math-2024-01-04-b689",
                          "Mistral-Multiverse Math-2024-01-04-b689"]

run_b689_df = concatenate_test_results(projects, run_b689_project_names)

plot_response_counts(count_responses_by_type(run_b689_df))
summarize_execution_time_by_response_type(run_b689_df)
countmeanstdmin25%50%75%max
modelResponse Type
DeciLMCorrect5.01.9021410.3087601.5743321.6391461.9546372.0023812.340207
Incorrect12.039.22111255.9794170.8032521.0084032.049315107.423659131.648682
Error3.037.47122959.9545041.4913862.8655734.23976055.461150106.682539
MistralIncorrect20.03.8382452.2354391.1527221.9329623.1244135.4461648.676092
plot_execution_time_by_correctness_minutes(run_b689_df)

Run e286

Here is a public link so you can go and examine the results for yourself.

In this run, I used the following prompt:


Differences compared to Prompt 256b

  1. Prefix:
    • Prompt e286: Begins with <s>[INST] followed by “instructions + “You must use the following tools to take action:”. It lacks the introductory line about the AI assistant’s role.
    • Prompt 256b: Starts with a detailed introduction “### System: You are an AI assistant that follows instruction extremely well. Help as much as you can.” before the instruction to use tools.
  2. Suffix:
    • Prompt e286: Has an empty suffix.
    • Prompt 256b: Includes a detailed suffix with a format for response: “Remember: {instructions} ALWAYS respond with the following format: \nAction: \n$JSON_BLOB\n \nObservation:\nThought:\n”.
  3. Human Message Template:
    • Prompt e286: Simplified to “Use the tools and solve the following problem: {input}\n{agent_scratchpad}\n[/INST]”.
    • Prompt 256b: More detailed, instructing to “Use the correct tool to answer the following question: {input}\n{agent_scratchpad}\n### Assistant:”.

The key differences are the inclusion of special formatting tags and a lack of introductory context in Prompt e286’s prefix, the absence of a suffix in e286, and a more streamlined human message template in e286 compared to the detailed structure in Prompt 256b.

run_e286_project_names = ["DeciLM-Multiverse Math-2024-01-04-e286",
                          "Mistral-Multiverse Math-2024-01-04-e286"]

run_e286_df = concatenate_test_results(projects, run_e286_project_names)

plot_response_counts(count_responses_by_type(run_e286_df))
summarize_execution_time_by_response_type(run_e286_df)
countmeanstdmin25%50%75%max
modelResponse Type
DeciLMCorrect5.01.8949480.2528341.7005781.7143581.7203062.1262972.213203
Incorrect9.012.07294221.0159920.6339881.0086911.7873122.85552850.367631
Error6.03.3175461.5453521.7080682.2671183.0375553.9187545.899308
MistralCorrect2.015.6067201.38095314.63023915.11847915.60672016.09496116.583202
Incorrect6.089.35913681.95425310.77786336.10985270.186366111.595963234.854044
Error12.021.23629630.9018964.3014687.09615610.18336119.965266115.502796
plot_execution_time_by_correctness_minutes(run_e286_df)


Run 6695

Here is a public link so you can go and examine the results for yourself.

In this run, I used the following prompt:


Differences compared to Prompt 256b

  1. Prefix:
    • Prompt 6695: Includes <s>[INST] followed by “instructions + “Use the following tools to take action:”. It lacks the introduction about the AI assistant’s capabilities.
    • Prompt 256b: Starts with “### System: You are an AI assistant that follows instruction extremely well. Help as much as you can.” and then instructs to use tools.
  2. Suffix:
    • Prompt 6695: Has an empty suffix.
    • Prompt 256b: Contains a detailed suffix specifying the response format: “Remember: {instructions} ALWAYS respond with the following format: \nAction: \n$JSON_BLOB\n \nObservation:\nThought:\n”.
  3. Human Message Template:
    • Prompt6695: Simplified to “Solve the following problem: {input}\n{agent_scratchpad}\n[/INST]”.
    • Prompt 256b: More detailed, stating “### User: Use the correct tool to answer the following question: {input}\n{agent_scratchpad}\n### Assistant:”.

The main differences are in the inclusion of special formatting tags and less context in Prompt 6695’s prefix, no suffix in 6695, and a more straightforward human message template compared to the more structured approach in Prompt 256b.

run_6695_project_names = ["DeciLM-Multiverse Math-2024-01-05-6695",
                          "Mistral-Multiverse Math-2024-01-05-6695"]

run_6695_df = concatenate_test_results(projects, run_6695_project_names)

plot_response_counts(count_responses_by_type(run_6695_df))
summarize_execution_time_by_response_type(run_6695_df)
countmeanstdmin25%50%75%max
modelResponse Type
DeciLMCorrect5.010.77823118.9348290.7515251.8285602.2526734.49652544.561874
Incorrect13.014.96373631.9369880.7926710.9485962.2666584.653235105.577623
Error2.04.1856860.2452904.0122404.0989634.1856864.2724094.359132
MistralCorrect3.07.0620262.4376025.1885365.6840616.1795867.9987719.817955
Incorrect5.0458.961024169.859725180.006650440.743363488.545084575.652366609.857656
Error12.031.60922159.9005161.6629593.5497734.98302215.078580191.099418
plot_execution_time_by_correctness_minutes(run_6695_df)


🌎 Global evaluation

It would be interesting to see the performance of models across prompts and questions. That’s what this section is about.


Pull all results data from LangSmith

# @title Pull all results data from LangSmith
dfs = []
for project in projects:
    test_results = client.get_test_results(project_name=project.name)
    test_results["model"] = project.extra['tags'][0]
    test_results["prompt"] = project.name.split('-')[-1]
    dfs.append(test_results)

all_results = pd.concat(dfs)

slim_results = all_results[['input.question', 'prompt','model','feedback.correctness', ]]


Performance across all prompts

Correctness Percentage is a metric used to evaluate the accuracy of responses. It is calculated as the ratio of the number of correct responses to the total number of responses that have been evaluated for correctness, expressed as a percentage.

In mathematical terms:

Correctness Percentage=(Number of Correct ResponsesTotal Number of Evaluated Responses)×100

  1. Model: DeciLM
    • Correct Responses: 32
    • Total Evaluated: 87
    • Correctness Percentage: Approximately 36.78%
    • Incorrect Responses: 55
    • Incorrectness Percentage: Approximately 63.22%
  2. Model: Mistral
    • Correct Responses: 11
    • Total Evaluated: 75
    • Correctness Percentage: Approximately 14.67%
    • Incorrect Responses: 64
    • Incorrectness Percentage: Approximately 85.33%

These results indicate that the ‘DeciLM’ model has a higher rate of correctness compared to the ‘Mistral’ model. Specifically, ‘DeciLM’ answered about 36.78% of the questions correctly, while ‘Mistral’ had a much lower correctness percentage of around 14.67%.


Correctness Analysis by Model

# @title Correctness Analysis by Model

# Grouping data by 'model' and calculating correctness statistics for each model
model_analysis = slim_results.groupby('model')['feedback.correctness'].agg(['sum', 'count', 'mean'])
model_analysis.columns = ['Correct Responses', 'Total Evaluated', 'Correctness Percentage']
model_analysis['Incorrect Responses'] = model_analysis['Total Evaluated'] - model_analysis['Correct Responses']
model_analysis['Incorrectness Percentage'] = 100 - (model_analysis['Correctness Percentage'] * 100)

model_analysis.reset_index(inplace=True)
model_analysis

modelCorrect ResponsesTotal EvaluatedCorrectness PercentageIncorrect ResponsesIncorrectness Percentage
0DeciLM32.0870.36781655.063.218391
1Mistral11.0750.14666764.085.333333


Performance by prompt


Correctness Analysis by both Prompt and Model

# @title Correctness Analysis by both Prompt and Model

# Specifying the order of the prompts
ordered_prompts = ['256b', '0b56', 'a0e6', 'b689', 'e286', '6695']
# Grouping data by 'prompt' and 'model' and calculating correctness statistics
prompt_model_analysis = slim_results.groupby(['prompt', 'model'])['feedback.correctness'].agg(['sum', 'count', 'mean'])
prompt_model_analysis.columns = ['Correct Responses', 'Total Evaluated', 'Correctness Percentage']
prompt_model_analysis['Incorrect Responses'] = prompt_model_analysis['Total Evaluated'] - prompt_model_analysis['Correct Responses']
prompt_model_analysis['Incorrectness Percentage'] = 100 - (prompt_model_analysis['Correctness Percentage'] * 100)

prompt_model_analysis.reset_index(inplace=True)

# Sort the DataFrame by the "prompt" column based on the specified order
prompt_model_analysis['prompt'] = pd.Categorical(prompt_model_analysis['prompt'], categories=ordered_prompts, ordered=True)
prompt_model_analysis = prompt_model_analysis.sort_values(by='prompt')

prompt_model_analysis
promptmodelCorrect ResponsesTotal EvaluatedCorrectness PercentageIncorrect ResponsesIncorrectness Percentage
2256bDeciLM8.0110.7272733.027.272727
3256bMistral5.080.6250003.037.500000
00b56DeciLM7.090.7777782.022.222222
10b56Mistral1.0110.09090910.090.909091
6a0e6DeciLM2.0180.11111116.088.888889
7a0e6Mistral0.0200.00000020.0100.000000
8b689DeciLM5.0170.29411812.070.588235
9b689Mistral0.0200.00000020.0100.000000
10e286DeciLM5.0140.3571439.064.285714
11e286Mistral2.080.2500006.075.000000
46695DeciLM5.0180.27777813.072.222222
56695Mistral3.080.3750005.062.500000

Correctness Percentage helps in understanding the effectiveness or accuracy of a system, model, or method in producing correct results.

  • Model Variability: The DeciLM model generally outperformed the Mistral model in most categories. This indicates a higher reliability or accuracy in the DeciLM model’s responses compared to those of the Mistral model.
  • Prompt-Specific Performance: The performance of both models varied considerably across different prompts. For instance, some prompts saw a high correctness percentage in one model and a much lower percentage in the other.
  • Overall Performance Trends: The DeciLM model consistently showed a higher correctness percentage across the prompts, indicating its overall superior performance in this dataset’s context. The Mistral model, while effective in certain prompts, generally lagged behind in accuracy.

Plot Correctness by both Prompt and Model

# @title Plot Correctness by both Prompt and Model
import numpy as np

# Filter and reorder the data based on the specified prompt order
deciLM_data_ordered = deciLM_data.set_index('prompt').loc[ordered_prompts].reset_index()
mistral_data_ordered = mistral_data.set_index('prompt').loc[ordered_prompts].reset_index()

# Setting up the figure
plt.figure(figsize=(15, 8))

# Number of categories (number of ordered prompts)
n_categories = len(ordered_prompts)

# Setting the positions of the bars on the x-axis
barWidth = 0.35
r1 = np.arange(n_categories)
r2 = [x + barWidth for x in r1]

# Creating bars
plt.bar(r1, deciLM_data_ordered['Correctness Percentage'], color='blue', width=barWidth, edgecolor='grey', label='DeciLM')
plt.bar(r2, mistral_data_ordered['Correctness Percentage'], color='orange', width=barWidth, edgecolor='grey', label='Mistral')

# Adding labels
plt.xlabel('Prompt', fontweight='bold')
plt.ylabel('Correctness Percentage (%)', fontweight='bold')
plt.xticks([r + barWidth/2 for r in range(n_categories)], ordered_prompts)

plt.title('Model Performance by Prompt')
plt.legend()
plt.show()


Question level analysis

Key Takeaways:

  1. Variability in Model Performance: The performance of each model varies significantly across different questions. For example, for “(1+2) + 5,” both models have low correctness percentages, while for “131,778 + 22,312?”, the DeciLM model shows a higher correctness percentage.
  2. Question Difficulty and Model Capability: Certain questions, like “-(1 + 1)”, show a 0% correctness rate for both models, suggesting these might be challenging questions or out of the model’s scope.
  3. Overall Correctness Trends: By comparing the number of correct and incorrect responses, we can gauge the overall tendency of each model to answer correctly. For instance, the DeciLM model shows a 50% correctness rate for “131,778 + 22,312?”, indicating that it’s more reliable for this specific type of question.

Creating a heatmap with questions as rows, prompts as columns, and count of correct answers as values

# @title Creating a heatmap with questions as rows, prompts as columns, and count of correct answers as values

# Grouping by 'input.question' only and recalculating
heatmap_data_question_only = slim_results.pivot_table(
    index='input.question',
    columns='prompt',
    values='Correct',
    aggfunc=lambda x: (x == 'Yes').sum(),
    fill_value=0
)[ordered_prompts]

# Creating the heatmap for questions only
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data_question_only, cmap='viridis', annot=True)

plt.title('Heatmap of Correct Answers Count by Question')
plt.xlabel('Prompt')
plt.ylabel('Question')
plt.show()


Pivot table counting correct responses by question, model and prompt

# @title Pivot table counting correct responses by question, model and prompt

slim_results['Correct'] = slim_results['feedback.correctness'].map({1: 'Yes', 0: 'No'}).fillna('Error')

pivot_table = slim_results.pivot_table(
    index=['input.question', 'model'],
    columns='prompt',
    values='Correct',
    aggfunc='first',
    fill_value='Error'
)[ordered_prompts].reset_index()

# Check if 'Total Correct' is already in the DataFrame
if 'Total Correct' in pivot_table.columns:
    # Exclude the 'Total Correct' column itself from the calculation
    pivot_table['Total Correct'] = (pivot_table.iloc[:, 2:-1] == 'Yes').sum(axis=1)
else:
    # If 'Total Correct' column is not present, include all prompt columns
    pivot_table['Total Correct'] = (pivot_table.iloc[:, 2:] == 'Yes').sum(axis=1)

# Adding the total number of correct answers per prompt
total_correct_by_prompt = (pivot_table.iloc[:, 2:-1] == 'Yes').sum().rename('Total Correct by Prompt')
final_table = pd.concat([pivot_table, pd.DataFrame([total_correct_by_prompt])], ignore_index=True)

final_table
promptinput.questionmodel256b0b56a0e6b689e2866695Total Correct
0(1+2) + 5DeciLMNoYesNoNoNoNo1.0
1(1+2) + 5MistralYesErrorNoNoErrorNo1.0
2-(1 + 1)DeciLMErrorErrorNoNoErrorError0.0
3-(1 + 1)MistralErrorErrorNoNoErrorNo0.0
4131,778 + 22,312?DeciLMErrorErrorNoYesNoYes2.0
5131,778 + 22,312?MistralErrorErrorNoNoErrorNo0.0
6Add 2 and 3DeciLMYesYesNoNoYesNo3.0
7Add 2 and 3MistralYesErrorNoNoNoError1.0
8Calculate 5 divided by 5DeciLMYesYesYesNoNoYes4.0
9Calculate 5 divided by 5MistralYesNoNoNoYesYes3.0
10Evaluate 1 + 2 + 3 + 4 + 5 using only the add …DeciLMNoErrorNoNoNoNo0.0
11Evaluate 1 + 2 + 3 + 4 + 5 using only the add …MistralErrorErrorNoNoErrorError0.0
12Evaluate the sum of the numbers 1 through 10 u…DeciLMNoErrorNoNoNoNo0.0
13Evaluate the sum of the numbers 1 through 10 u…MistralErrorNoNoNoNoError0.0
14I ate 1 apple and 2 oranges every day for 7 da…DeciLMErrorErrorErrorNoErrorNo0.0
15I ate 1 apple and 2 oranges every day for 7 da…MistralErrorNoNoNoErrorError0.0
16Subtract 3 from 2DeciLMYesYesNoYesYesYes5.0
17Subtract 3 from 2MistralErrorNoNoNoErrorError0.0
18What is -5 if evaluated using the negate funct…DeciLMErrorErrorNoNoErrorYes1.0
19What is -5 if evaluated using the negate funct…MistralErrorErrorNoNoErrorError0.0
20after calculating the sin of 1.5 radians, divi…DeciLMYesErrorErrorNoNoNo1.0
21after calculating the sin of 1.5 radians, divi…MistralErrorNoNoNoErrorNo0.0
22calculate 101 to the power of 0.5DeciLMYesYesNoYesYesYes5.0
23calculate 101 to the power of 0.5MistralNoNoNoNoNoError0.0
24convert 15 degrees to radiansDeciLMErrorErrorNoErrorNoNo0.0
25convert 15 degrees to radiansMistralErrorNoNoNoNoNo0.0
26ecoli divides every 20 minutes. How many cells…DeciLMErrorNoNoNoNoNo0.0
27ecoli divides every 20 minutes. How many cells…MistralNoErrorNoNoErrorError0.0
28evaluate negate(-131,778)DeciLMErrorErrorNoNoErrorNo0.0
29evaluate negate(-131,778)MistralErrorErrorNoNoErrorError0.0
30how much is 131,778 divided by 2?DeciLMYesNoYesYesNoNo3.0
31how much is 131,778 divided by 2?MistralErrorNoNoNoNoError0.0
32multiply the result of (log of 100 to base 10)…DeciLMErrorErrorNoErrorErrorError0.0
33multiply the result of (log of 100 to base 10)…MistralYesNoNoNoErrorError1.0
34what is cos(pi)?DeciLMErrorErrorNoErrorErrorNo0.0
35what is cos(pi)?MistralNoNoNoNoNoError0.0
36what is the result of 2 to the power of 3?DeciLMYesYesNoYesYesNo4.0
37what is the result of 2 to the power of 3?MistralErrorYesNoNoErrorYes2.0
38what is the value of pi?DeciLMYesYesNoNoYesNo3.0
39what is the value of pi?MistralYesErrorNoNoYesYes3.0
40NaNNaN1382578NaN


The Real Deal with LLM Benchmarking

Alright, wrapping this up, let’s be clear about something: LangChain’s benchmarking framework is awesome. I seriously love it. But, and it’s a big but, we’ve got to talk about how little changes in prompts can turn the tables for LLMs in these benchmarks. Here’s the thing: I showed you how DeciLM seems to be outdoing Mistral in my tests.

But hold up – don’t just take my word for it.

I could be tweaking things to make it look that way, right? That’s the point here. We can all make these benchmarks dance to our tune if we’re clever with the prompts.

So, where does that leave us?

Can we really trust any benchmarking results out there? I mean, if a few words here and there can flip the script, what’s the point in comparing these models at all? This whole thing’s given me a bit of an existential crisis about LLM evals.

Also, keep in mind, the models I used are relatively small, like 7B parameters, and not specifically fine-tuned for function calling. Maybe the big guys, the larger models, show some different, more agentic behavior. DeciLM is looking good in my tests, but I’m curious how smaller, fine-tuned agents would do.

I didn’t set out to say one model’s better than the other.

How could I, when a couple of tweaks in the prompt can swing results all over the place? But I want to see what you find. Play around with your own prompts and share your results. Drop into the Deep Learning Daily Community Discord and let us all in on what you did.

Long story short, LLM evaluations are tricky. They’re like walking through a minefield, but it’s a field we’ve got to cross.

So let’s do it together, share our maps, and figure this out as a community.


Discover Deci’s High-Performance LLMs and GenAI Development Platform

In addition to DeciLM-7B, Deci offers a suite of fine-tunable, high performance LLMs, available through our GenAI Development Platform. Designed to balance quality, speed, and cost-effectiveness, our models are complemented by flexible deployment options. Customers can access them through our platform’s API or opt for deployment on their own infrastructure, whether through a Virtual Private Cloud (VPC) or directly within their data centers.

If you’re interested in exploring our LLMs firsthand, we encourage you to sign up for a free trial of our API.

For those curious about our VPC and on-premises deployment options, we encourage you to book a 1:1 session with our experts.

You May Also Like

Qualcomm Snapdragon Quantization

Qualcomm Snapdragon: Optimizing YOLO Performance with Advanced SNPE Quantization

The Ultimate Guide to LLM Evaluation 

Top Large Language Models Reshaping the Open-Source Arena

The latest deep learning insights, tips, and best practices delivered to your inbox.

Share
Add Your Heading Text Here
				
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")