How to Evaluate your LLM Applications?

7 min readOct 2, 2023

Why is Evaluating Large Language Models Tricky? 🤯

Large Language Models (LLMs) have emerged as a groundbreaking force and have revolutionized artificial intelligence with their sheer power and sophistication. While being game-changers, they introduce complexities that differentiate them from conventional machine learning and deep learning models, the most notable one being their lack of Ground Truth to evaluate them.

In traditional ML/DL, ‘Ground Truths’ are an essential part of model building as they are used for measuring the quality of the model’s predictions or classifications. These evaluations are essential to determine which one of the numerous model experiments should be deployed in production as well as help teams to sample production data and annotate it to determine cohorts of low model performance to improve upon.

LLMs, on the other hand, operate in realms where defining a clear Ground Truth is challenging. They are often tasked with producing human-like text where there’s no single “right” answer to do a word-to-word comparison against. There is no “correct” marketing copy or “correct” sales email to benchmark your application against.

Now what? 🤔

Well, one way is to involve humans and ask them to rate the LLM responses. This is essentially what most of us are doing when we manually look at the responses of two models or two prompts and determine which one looks better. Although quick to start, this becomes time-consuming as well as highly subjective very quickly. Given the fact that RLHF involves creating an (LLM) reward model that scores model responses during training, this shows a new direction to explore LLMs themselves as evaluators. Imagine ChatGPT first writing an email and then verifying that the email is good to go — sounds pretty self-consumed right?

Well, both yes and no. If we just ask the LLM to grade whether the response is correct or not, it won’t work well (for obvious reasons). On the other hand, if we break down the evaluation into smaller, simpler, and specific tasks, say asking LLM to check if the response is grounded by context or not, or if it follows a certain guideline or not, the approach does wonders and allows us to get reliable scores across multiple criteria which a human developer (hopefully :P) can interpret and use them to assess the correctness of their LLM application.

Dimensions of LLM Evaluations

Evaluating LLM-generated responses varies by application. For marketing messages, we consider creativity, brand tone, appropriateness, and conciseness. Customer support chatbots must be assessed for hallucinations, politeness, and response completeness. Code generation apps focus on syntax accuracy, complexity, and code quality. And there are custom metrics customized for custom use cases.

We are seeing quite a few efforts to systematize these dimensions. Flask [1] has divided these dimensions into 4 categories, each concerned with evaluating logical thinking, background knowledge, problem handling, and user alignment respectively. LLM-Eval [2] has introduced a unified framework to evaluate multiple dimensions like Correctness, engagement, fluency, etc. by leveraging LLM’s inherent reasoning capabilities to rate its responses. A similar set of evaluation dimensions can also be found in [3] and [4].

At UpTrain, we have been working with numerous developers building LLM-powered products helping them to evaluate their application quality, and have come up with a novel categorization to group these dimensions based on their applicability and help simplify their selection process. We propose 4 key categories:

Checks for evaluating Task Understanding and Context Awareness

Example for Response Completeness, Hallucinations, and Context Retrieval Quality

This category deals with metrics that evaluate if your LLM + prompt configuration can comprehend the task at hand as well as fully utilize the provided context to provide an appropriate response. This category is further divided into 2 sub-categories:

Response Appropriateness: This includes dimensions such as Intent understanding (checking if the response correctly follows the intent of the user query), response completeness (checking if the response answers all the aspects of the given user query), response relevancy (checking if the response doesn’t contain any additional irrelevant information), structural integrity (ensuring that the response follows the required schema or JSON structure), and semantic similarity between the query and the response. The performance on these dimensions varies wildly depending on the LLM prompt, chain configuration as well and the reasoning capabilities of the underlying foundational model and are great starting points to access the performance of your LLM applications.
Context Awareness and Grounding: One of the most critical requirements for production-LLM applications is the ability to avoid hallucinations. As your LLM application produces responses in the wild, you don’t want your LLM to make up facts and essentially have their responses grounded by the provided context. Key dimensions for evaluation include factual accuracy (context alignment), retrieved-context quality (adequate information), and context utilization score (effectiveness in using context). These dimensions are crucial to evaluate for an RAG (Retrieval-Augmented-Generation) application.

Checks for evaluating Language Quality

Example for Toxicity, Tonality and Interestingness

Apart from the correctness of (response) content, an LLM must present this information in a coherent and desired manner. This category includes dimensions that help to evaluate the quality of the response from a language perspective. We further divide this into two subcategories:

Task Independent: Dimensions like grammar correctness, fluency, coherence, toxicity, fairness towards all sectors of society, etc. are universal and how to evaluate them doesn’t depend upon the specific task. As LLMs are trained over billions of natural-sounding text, we typically see high scores for fluency, coherence, and grammar correctness whereas appropriate finetuning of the LLM (via techniques like RLHF) helps to align the LLM on toxicity, fairness, and bias-related dimensions.
Task dependent: In many cases, the quality of the response depends on the task that the LLM application aims to perform. An informal response makes sense for a friendly and helpful sales assistant but not for a transactional log analyzer. Evaluations along dimensions like tonality (if the response matches the desired brand tone), creativity, and interestingness of the response, etc. require additional information about the given task or persona and can be quite critical to ensure your application can able to generate good enough responses which resonates well with your end users.

Evaluating Reasoning Capabilities

Table from Llama 2 paper: https://arxiv.org/abs/2307.09288

LLM applications excel in complex reasoning tasks, involving understanding user intent, extracting context-based knowledge, analyzing response options, and crafting concise, correct replies. Models with enhanced reasoning capabilities outperform others. This includes dimensions like logical correctness (right conclusions), logical robustness (consistent with minor input changes), logical efficiency (shortest solution path), and common sense understanding (grasping common concepts). Typically, starting with a great model and utilizing prompt techniques like Chain-of-thought is the best one can do here.

Custom Evaluations

Many applications require customized metrics tailored to their specific needs. For instance, a customer support chatbot might need a custom metric to check if it is following a developer-defined guideline of not mentioning pricing-related information in the response. An LLM-based link-sharing app might require verifying trusted domains. An article summarization bot may want to assess the presence of specific article elements. In LLM-powered applications, as with any machine learning system, there’s no universal solution. It’s vital to define custom evaluation criteria that closely align with the business goals to accurately measure performance.

UpTrain — Open-Source LLM Evaluations

In conclusion, evaluating LLM applications is a complex yet necessary task that involves both predefined dimensions as well as custom metrics to check various aspects. While traditional NLP metrics like Rouge, BLEU, Perplexity, etc. are also useful indicators, the generative capabilities of LLMs make their applicability limited. At UpTrain, we are building an open-source tool to run LLM evaluations. UpTrain supports a variety of dimensions such as language quality, retrieved-context quality, hallucinations, tonality, response completeness, etc. as well as allows one to define custom evaluations via simple Python functions. Check out our repo here or book a call with me to understand which evaluations make the most sense for your application.

References

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [PDF]
LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models [PDF]
Automatic Evaluation and Moderation of Open-domain Dialogue Systems [PDF]
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization [PDF]