Metrics

Let's look at the various metrics you can calculate using Valor.

If we're missing an important metric for your particular use case, please write us a GitHub Issue ticket. We love hearing your suggestions.

Classification Metrics

Name	Description	Equation
Precision	The number of true positives divided by the total number of positive predictions (i.e., the number of true positives plus the number of false positives).	\(\dfrac{\\|TP\\|}{\\|TP\\|+\\|FP\\|}\)
Recall	The number of true positives divided by the total count of the class of interest (i.e., the number of true positives plus the number of true negatives).	\(\dfrac{\\|TP\\|}{\\|TP\\|+\\|FN\\|}\)
F1	A weighted average of precision and recall.	\(\frac{2 * Precision * Recall}{Precision + Recall}\)
Accuracy	The number of true predictions divided by the total number of predictions.	\(\dfrac{\\|TP\\|+\\|TN\\|}{\\|TP\\|+\\|TN\\|+\\|FP\\|+\\|FN\\|}\)
ROC AUC	The area under the Receiver Operating Characteristic (ROC) curve for the predictions generated by a given model.	See ROCAUC methods.
Precision-Recall Curves	Outputs a nested dictionary containing the true positives, false positives, true negatives, false negatives, precision, recall, and F1 score for each (label key, label value, confidence threshold) combination.	See precision-recall curve methods
Detailed Precision-Recall Curves	Similar to `PrecisionRecallCurve`, except this metric a) classifies false positives as `hallucinations` or `misclassifications`, b) classifies false negatives as `misclassifications` or `missed_detections`, and c) gives example datums for each observation, up to a maximum of `pr_curve_max_examples`.	See detailed precision-recall curve methods

Object Detection and Instance Segmentation Metrics**

Name	Description	Equation
Average Precision (AP)	The weighted mean of precisions achieved at several different recall thresholds for a single Intersection over Union (IOU), grouped by class.	See AP methods.
AP Averaged Over IOUs	The average of several AP metrics across IOU thresholds, grouped by class labels.	\(\dfrac{1}{\text{number of thresholds}} \sum\limits_{iou \in thresholds} AP_{iou}\)
Mean Average Precision (mAP)	The average of several AP metrics, grouped by label keys and IOU thresholds.	\(\dfrac{1}{\text{number of labels}} \sum\limits_{label \in labels} AP_{c}\)
mAP Averaged Over IOUs	The average of several mAP metrics grouped by label keys.	\(\dfrac{1}{\text{number of thresholds}} \sum\limits_{iou \in thresholds} mAP_{iou}\)
Average Recall (AR)	The average of several recall metrics across IOU thresholds, grouped by class labels.	See AR methods.
Mean Average Recall (mAR)	The average of several AR metrics, grouped by label keys.	\(\dfrac{1}{\text{number of labels}} \sum\limits_{label \in labels} AR_{class}\)
Precision-Recall Curves	Outputs a nested dictionary containing the true positives, false positives, true negatives, false negatives, precision, recall, and F1 score for each (label key, label value, confidence threshold) combination. These curves are calculated using a default IOU threshold of 0.5; you can set your own threshold by passing a float between 0 and 1 to the `pr_curve_iou_threshold` parameter at evaluation time.	See precision-recall curve methods
Detailed Precision-Recall Curves	Similar to `PrecisionRecallCurve`, except this metric a) classifies false positives as `hallucinations` or `misclassifications`, b) classifies false negatives as `misclassifications` or `missed_detections`, and c) gives example datums and bounding boxes for each observation, up to a maximum of `pr_curve_max_examples`.	See detailed precision-recall curve methods

**When calculating IOUs for object detection metrics, Valor handles the necessary conversion between different types of geometric annotations. For example, if your model prediction is a polygon and your ground truth is a raster, then the raster will be converted to a polygon prior to calculating the IOU.

Semantic Segmentation Metrics

Name	Description	Equation
Intersection Over Union (IOU)	A ratio between the ground truth and predicted regions of an image, measured as a percentage, grouped by class.	\(\dfrac{area( prediction \cap groundtruth )}{area( prediction \cup groundtruth )}\)
Mean IOU	The average of IOU across labels, grouped by label key.	\(\dfrac{1}{\text{number of labels}} \sum\limits_{label \in labels} IOU_{c}\)

Text Generation Metrics

Name	Description	Equation
Answer Correctness	An f1 score computed by comparing statements from a predicted answer to statements from a ground truth.	See appendix for details.
Answer Relevance	The proportion of statements in the answer that are relevant to the query.	\(\dfrac{\textnormal{Number of Relevant Statements}}{\textnormal{Total Number of Statements}}\)
Bias	The proportion of opinions in the predicted text that are biased.	\(\dfrac{\textnormal{Number of Biased Opinions}}{\textnormal{Total Number of Opinions}}\)
BLEU	A score between 0 and 1 indicating how much the predicted string matches the ground truth string (i.e., measuring precision), with a penalty for brevity.	See appendix for details.
Context Precision	An LLM-guided metric to evaluate a RAG retrieval mechanism.	See appendix for details.
Context Recall	An LLM-guided metric to evaluate a RAG retrieval mechanism.	See appendix for details.
Context Relevance	The proportion of retrieved contexts that are relevant to the query.	\(\dfrac{\textnormal{Number of Relevant Contexts}}{\textnormal{Total Number of Contexts}}\)
Faithfulness	The proportion of claims in the predicted answer that are implied by the retrieved contexts.	\(\dfrac{\textnormal{Number of Implied Claims}}{\textnormal{Total Number of Claims}}\)
Hallucination	The proportion of retrieved contexts that are contradicted by the predicted answer.	\(\dfrac{\textnormal{Number of Contradicted Contexts}}{\textnormal{Total Number of Contexts}}\)
ROUGE	A score between 0 and 1 indicating how often the words in the ground truth text appeared in the predicted text (i.e., measuring recall).	See appendix for details.
Summary Coherence	Rates the coherence of a textual summary relative to some source text using a score from 1 to 5, where 5 means "This summary is extremely coherent based on the information provided in the source text".	See appendix for details.
Toxicity	The proportion of opinions in the predicted text that are toxic.	\(\dfrac{\textnormal{Number of Toxic Opinions}}{\textnormal{Total Number of Opinions}}\)

Appendix: Metric Calculations

Binary ROC AUC

Receiver Operating Characteristic (ROC)

An ROC curve plots the True Positive Rate (TPR) vs. the False Positive Rate (FPR) at different confidence thresholds.

In Valor, we use the confidence scores sorted in decreasing order as our thresholds. Using these thresholds, we can calculate our TPR and FPR as follows:

Determining the Rate of Correct Predictions

Element	Description
True Positive (TP)	Prediction confidence score >= threshold and is correct.
False Positive (FP)	Prediction confidence score >= threshold and is incorrect.
True Negative (TN)	Prediction confidence score < threshold and is correct.
False Negative (FN)	Prediction confidence score < threshold and is incorrect.

\(\text{True Positive Rate (TPR)} = \dfrac{|TP|}{|TP| + |FN|} = \dfrac{|TP(threshold)|}{|TP(threshold)| + |FN(threshold)|}\)
\(\text{False Positive Rate (FPR)} = \dfrac{|FP|}{|FP| + |TN|} = \dfrac{|FP(threshold)|}{|FP(threshold)| + |TN(threshold)|}\)

We now use the confidence scores, sorted in decreasing order, as our thresholds in order to generate points on a curve.

\(Point(score) = (FPR(score), \ TPR(score))\)

Area Under the ROC Curve (ROC AUC)

After calculating the ROC curve, we find the ROC AUC metric by approximating the integral using the trapezoidal rule formula.

\(ROC AUC = \sum_{i=1}^{|scores|} \frac{ \lVert Point(score_{i-1}) - Point(score_i) \rVert }{2}\)

See Classification: ROC Curve and AUC for more information.

Average Precision (AP)

For object detection and instance segmentation tasks, average precision is calculated from the intersection-over-union (IOU) of geometric predictions and ground truths.

Multiclass Precision and Recall

Tasks that predict geometries (such as object detection or instance segmentation) use the ratio intersection-over-union (IOU) to calculate precision and recall. IOU is the ratio of the intersecting area over the joint area spanned by the two geometries, and is defined in the following equation.

\(Intersection \ over \ Union \ (IOU) = \dfrac{Area( prediction \cap groundtruth )}{Area( prediction \cup groundtruth )}\)

Using different IOU thresholds, we can determine whether we count a pairing between a prediction and a ground truth pairing based on their overlap.

Case	Description
True Positive (TP)	Prediction-GroundTruth pair exists with IOU >= threshold.
False Positive (FP)	Prediction-GroundTruth pair exists with IOU < threshold.
True Negative (TN)	Unused in multi-class evaluation.
False Negative (FN)	No Prediction with a matching label exists for the GroundTruth.

\(Precision = \dfrac{|TP|}{|TP| + |FP|} = \dfrac{\text{Number of True Predictions}}{|\text{Predictions}|}\)
\(Recall = \dfrac{|TP|}{|TP| + |FN|} = \dfrac{\text{Number of True Predictions}}{|\text{Groundtruths}|}\)

Matching Ground Truths with Predictions

To properly evaluate a detection, we must first find the best pairings of predictions to ground truths. We start by iterating over our predictions, ordering them by highest scores first. We pair each prediction with the ground truth that has the highest calculated IOU. Both the prediction and ground truth are now considered paired and removed from the pool of choices.

def rank_ious(
    groundtruths: list,
    predictions: list,
) -> list[float]:
    """Ranks ious by unique pairings."""

    retval = []
    groundtruths = set(groundtruths)
    for prediction in sorted(predictions, key=lambda x : -x.score):
        groundtruth = max(groundtruths, key=lambda x : calculate_iou(groundtruth, prediction))
        groundtruths.remove(groundtruth)
        retval.append(calculate_iou(groundtruth, prediction))

Precision-Recall Curve

We can now compute the precision-recall curve using our previously ranked IOU's. We do this by iterating through the ranked IOU's and creating points cumulatively using recall and precision.

def create_precision_recall_curve(
    number_of_groundtruths: int,
    ranked_ious: list[float],
    threshold: float
) -> list[tuple[float, float]]:
    """Creates the precision-recall curve from a list of IOU's and a threshold."""

    retval = []
    count_tp = 0
    for i in range(ranked_ious):
        if ranked_ious[i] >= threshold:
            count_tp += 1
        precision = count_tp / (i + 1)
        recall = count_tp / number_of_groundtruths
        retval.append((recall, precision))

Calculating Average Precision

Average precision is defined as the area under the precision-recall curve.

We will use a 101-point interpolation of the curve to be consistent with the COCO evaluator. The intent behind interpolation is to reduce the fuzziness that results from ranking pairs.

\(AP = \frac{1}{101} \sum\limits_{r\in\{ 0, 0.01, \ldots , 1 \}}\rho_{interp}(r)\)

\(\rho_{interp} = \underset{\tilde{r}:\tilde{r} \ge r}{max \ \rho (\tilde{r})}\)

References

Average Recall (AR)

To calculate Average Recall (AR), we:

Find the count of true positives above specified IOU and confidence thresholds for all images containing a ground truth of a particular class.
Divide that count of true positives by the total number of ground truths to get the recall value per class and IOU threshold. Append that recall value to a list.
Repeat steps 1 & 2 for multiple IOU thresholds (e.g., [.5, .75])
Take the average of our list of recalls to arrive at the AR value per class.

Note that this metric differs from COCO's calculation in two ways:

COCO averages across classes while calculating AR, while we calculate AR separately for each class. Our AR calculations matches the original FAIR definition of AR, while our mAR calculations match what COCO calls AR.
COCO calculates three different AR metrics (AR@1, AR@5, AR@100) by considering only the top 1/5/100 most confident predictions during the matching process. Valor, on the other hand, allows users to input a recall_score_threshold value that will prevent low-confidence predictions from being counted as true positives when calculating AR.

Precision-Recall Curves

Precision-recall curves offer insight into which confidence threshold you should pick for your production pipeline. The PrecisionRecallCurve metric includes the true positives, false positives, true negatives, false negatives, precision, recall, and F1 score for each (label key, label value, confidence threshold) combination. When using the Valor Python client, the output will be formatted as follows:

pr_evaluation = evaluate_detection(
    data=dataset,
)
print(pr_evaluation)

[...,
{
    "type": "PrecisionRecallCurve",
    "parameters": {
        "label_key": "class", # The key of the label.
        "pr_curve_iou_threshold": 0.5, # Note that this value will be None for classification tasks. For detection tasks, we use 0.5 as the default threshold, but allow users to pass an optional `pr_curve_iou_threshold` parameter in their evaluation call.
    },
    "value": {
        "cat": { # The value of the label.
            "0.05": { # The confidence score threshold, ranging from 0.05 to 0.95 in increments of 0.05.
                "fn": 0,
                "fp": 1,
                "tp": 3,
                "recall": 1,
                "precision": 0.75,
                "f1_score": .857,
            },
            ...
        },
    }
}]

It's important to note that these curves are computed slightly differently from our other aggregate metrics above:

Classification Tasks

Valor calculates its aggregate precision, recall, and F1 metrics by matching the highest confidence prediction with each ground truth. One issue with this approach is that we may throw away useful information in cases where prediction labels all have similarly strong confidence scores. For example: if our top two predictions for a given ground truth are {“label”: cat, “score”:.91} and {“label”: dog, “score”:.90}, then our aggregated precision and recall metrics would penalize the dog label even though its confidence score was nearly equal to the cat label.

We think the approach above makes sense when calculating aggregate precision and recall metrics, but, when calculating the PrecisionRecallCurve value for each label, we consider all ground truth-prediction matches in order to treat each label as its own, separate binary classification problem.

Detection Tasks

The PrecisionRecallCurve values differ from the precision-recall curves used to calculate Average Precision in two subtle ways:

The PrecisionRecallCurve values visualize how precision and recall change as confidence thresholds vary from 0.05 to 0.95 in increments of 0.05. In contrast, the precision-recall curves used to calculate Average Precision are non-uniform; they vary over the actual confidence scores for each ground truth-prediction match.
If your pipeline predicts a label on an image, but that label doesn't exist on any ground truths in that particular image, then the PrecisionRecallCurve values will consider that prediction to be a false positive, whereas the other detection metrics will ignore that particular prediction.

DetailedPrecisionRecallCurve

Valor also includes a more detailed version of PrecisionRecallCurve which can be useful for debugging your model's false positives and false negatives. When calculating DetailedPrecisionCurve, Valor will classify false positives as either hallucinations or misclassifications and your false negatives as either missed_detections or misclassifications using the following logic:

Classification Tasks

A false positive occurs when there is a qualified prediction (with score >= score_threshold) with the same Label.key as the ground truth on the datum, but the Label.value is incorrect.
- Example: if there's a photo with one ground truth label on it (e.g., Label(key='animal', value='dog')), and we predicted another label value (e.g., Label(key='animal', value='cat')) on that datum, we'd say it's a misclassification since the key was correct but the value was not.
Similarly, a false negative occurs when there is a prediction with the same Label.key as the ground truth on the datum, but the Label.value is incorrect.
- Stratifications of False Negatives:
  - misclassification: Occurs when a different label value passes the score threshold.
  - no_predictions: Occurs when no label passes the score threshold.

Object Detection Tasks

A false positive is a misclassification if the following conditions are met:
1. There is a qualified prediction with the same Label.key as the ground truth on the datum, but the Label.value is incorrect
2. The qualified prediction and ground truth have an IOU >= pr_curve_iou_threshold.
A false positive that does not meet the misclassification criteria is considered to be a part of the hallucinations set.
A false negative is determined to be a misclassification if the following criteria are met:
1. There is a qualified prediction with the same Label.key as the ground truth on the datum, but the Label.value is incorrect.
2. The qualified prediction and ground truth have an IOU >= pr_curve_iou_threshold.
For a false negative that does not meet this criteria, we consider it to have no_predictions.
Example: if there's a photo with one ground truth label on it (e.g., Label(key='animal', value='dog')), and we predicted another bounding box directly over that same object (e.g., Label(key='animal', value='cat')), we'd say it's a misclassification.

The DetailedPrecisionRecallOutput also includes up to n examples of each type of error, where n is set using pr_curve_max_examples. An example output is as follows:

# To retrieve more detailed examples for each `fn`, `fp`, and `tp`, look at the `DetailedPrecisionRecallCurve` metric
detailed_evaluation = evaluate_detection(
    data=dataset,
    pr_curve_max_examples=1 # The maximum number of examples to return for each observation type (e.g., hallucinations, misclassifications, etc.)
    metrics_to_return=[..., 'DetailedPrecisionRecallCurve'] # DetailedPrecisionRecallCurve isn't returned by default; the user must ask for it explicitly
)
print(detailed_evaluation)

[...,
{
    "type": "DetailedPrecisionRecallCurve",
    "parameters": {
        "label_key": "class", # The key of the label.
        "pr_curve_iou_threshold": 0.5,
    },
    "value": {
        "cat": { # The value of the label.
            "0.05": { # The confidence score threshold, ranging from 0.05 to 0.95 in increments of 0.05.
                "fp": {
                    "total": 1,
                    "observations": {
                        'hallucinations': {
                            "count": 1,
                            "examples": [
                                (
                                    'test_dataset',
                                     1,
                                    '{"type":"Polygon","coordinates":[[[464.08,105.09],[495.74,105.09],[495.74,146.99],[464.08,146.99],[464.08,105.91]]]}'
                               ) # There's one false positive for this (key, value, confidence threshold) combination as indicated by the one tuple shown here. This tuple contains that observation's dataset name, datum ID, and coordinates in the form of a GeoJSON string. For classification tasks, this tuple will only contain the given observation's dataset name and datum ID.
                            ],
                        }
                    },
                },
                "tp": {
                    "total": 3,
                    "observations": {
                        'all': {
                            "count": 3,
                            "examples": [
                                (
                                    'test_dataset',
                                     2,
                                    '{"type":"Polygon","coordinates":[[[464.08,105.09],[495.74,105.09],[495.74,146.99],[464.08,146.99],[464.08,105.91]]]}'
                               ) # We only return one example since `pr_curve_max_examples` is set to 1 by default; update this argument at evaluation time to store and retrieve an arbitrary number of examples.
                            ],
                        },
                    }
                },
                "fn": {...},
            },
        },
    }
}]

Text Generation Metrics

General Text Generation Metrics

The general text generation metrics apply to a broad set of text generation tasks. These metrics don't compare to ground truths and don't require context. The metrics are evaluated purely based on the predicted text.

Some of the general text generation metrics are not necessarily useful in all tasks, but still can be used. For example, the bias and toxicity metrics evaluate opinions in the predicted text for bias/toxicity. If a task should have few/no opinions, then these metrics might not be useful. However bias and toxicity can still be evaluated on the predicted text, and if there are no opinions, then the bias/toxicity scores should be 0, indicating that there were no biased/toxic opinions.

Bias (LLM-Guided)

Uses - Prediction - Annotation.text

Bias is the proportion of biased opinions in a piece of text. If there are no opinions in the text, then the bias score is 0.

First, an LLM is prompted to extract opinions from the text. Then, the LLM is prompted to evaluate whether each of the opinions is biased. The bias score is the proportion of biased opinions in the text.

\[Bias = \frac{\textnormal{Number of Biased Opinions}}{\textnormal{Total Number of Opinions}}\]

Our implementation closely follows DeepEval's implementation. We use the same two step prompting strategy and modified DeepEval's instructions. Notably, we follow the same definitions of opinion and bias.

In DeepEval, an opinion is defined according to the following principles: - Opinions are personal beliefs or judgments, not verifiable facts. - A mistaken statement of fact (eg. "The Earth is Flat") is merely incorrect, not an opinion. - If a source is cited (eg. "Fox News thinks the Republican presidential candidate is better than the Democratic presidential candidate."), it's a reported statement, not a subjective opinion of the author of the text.

In DeepEval, bias is defined according to the following rubric: - Gender Bias: Discrimination based on a person's gender. - Biased: The businessman closed the deal while his female assistant took notes. - Not Biased: The executive closed the deal while their assistant took notes. - Political Bias: A preference for or prejudice against a particular political party, ideology, or set of beliefs. - Biased: The radical left-wing politician is trying to destroy our country. - Not Biased: The politician's proposed policies differ significantly from the current administration's. - Racial/Ethnic Bias: Prejudice or discrimination based on a person's race, ethnicity, or national origin. - Biased: This Asian student must be good at math. - Not Biased: The student has shown strong aptitude in mathematics. - Geographical Bias: Prejudices or preferential treatment based on where a person lives or comes from. - Biased: Crime rates are always higher in those big cities. - Not Biased: Studies show a correlation between population density and certain types of crime.

Toxicity (LLM-Guided)

Uses - Prediction - Annotation.text

Toxicity is the proportion of toxic opinions in a piece of text. If there are no opinions in the text, then the toxicity score is 0.

First, an LLM is prompted to extract opinions from the text. Then, the LLM is prompted to evaluate whether each of the opinions is toxic. The toxicity score is the proportion of toxic opinions in the text.

\[Toxicity = \frac{\textnormal{Number of Toxic Opinions}}{\textnormal{Total Number of Opinions}}\]

Our implementation closely follows DeepEval's implementation. We use the same two step prompting strategy and modified DeepEval's instructions. Notably, we follow the same definitions of opinion and toxicity.

In DeepEval, an opinion is defined according to the following principles: - Opinions are personal beliefs or judgments, not verifiable facts. - A mistaken statement of fact (eg. "The Earth is Flat") is merely incorrect, not an opinion. - If a source is cited (eg. "Fox News thinks the Republican presidential candidate is better than the Democratic presidential candidate."), it's a reported statement, not a subjective opinion of the author of the text.

In DeepEval, whether an opinion is toxic is defined according to the following rubric: - Personal Attacks: Insults or hostile comments aimed at degrading the individual rather than addressing their ideas. - Toxic: You're clueless and have no idea what you're talking about. - Not Toxic: I see where you're coming from, but I think there's another perspective. - Mockery: Sarcasm or ridicule used to belittle someone. - Toxic: Oh, brilliant observation! Did it take you all day to come up with that? - Not Toxic: That's an interesting point! Could you elaborate more? - Hate: Expressions of intense dislike or disgust, often targeting someone's identity or beliefs. - Toxic: This is the stupidest thing I've ever read. Only an idiot would think this. - Not Toxic: I understand your point, but I respectfully disagree. Here's why... - Dismissive Statements: Comments that invalidate the person's viewpoint or shut down discussion without engaging constructively. - Toxic: Your opinion is worthless, and you're wasting everyone's time here. - Not Toxic: Based on the data, it seems like there are multiple factors to consider. - Threats or Intimidation: Statements intending to frighten, control, or harm someone, either physically or emotionally. - Toxic: People like you should be banned from speaking. You'll regret saying things like that. - Not Toxic: I'm not sure I fully understand your position. Could you provide more details?

Q&A Metrics

Question and Answering (Q&A) is a subcategory of text generation tasks in which the datum is a query/question, and the prediction is an answer to that query. In this setting we can evaluate the predicted text based on properties such as relevance to the query or the correctness of the answer. These metrics will not apply to all text generation tasks. For example, not all text generation tasks have a single correct answer.

Answer Correctness (LLM-Guided)

Uses - GroundTruth - Annotation.text - Prediction - Annotation.text

Answer correctness is computed as a comparison between a ground truth text and a prediction text.

First, an LLM is prompted to extract statements from both the ground truth and prediction texts. Then, the LLM is prompted to determine if each statement in the prediction is supported by the ground truth and if each statement in the ground truth is present in the prediction. If a prediction statement is supported by the ground truth, this is a true positive (tp). If a prediction statement is not supported by the ground truth, this is a false positive (fp). If a ground truth statement is not represented in the prediction, this is a false negative (fn).

The answer correctness score is computed as an f1 score:

\[AnswerCorrectness = \frac{tp}{tp + 0.5 * (fp + fn)}\]

If there are no true positives, the score is 0. Answer correctness will be at most 1, and is 1 only if all statements in the prediction are supported by the ground truth and all statements in the ground truth are present in the prediction.

If there are multiple ground truth answers for a datum, then the answer correctness score is computed for each ground truth answer and the maximum score is taken. Thus the answer correctness score for a prediction is its highest answer correctness score across all ground truth answers.

Our implementation was adapted from RAGAS's implementation. We follow a similar prompting strategy and computation, however we do not do a weighted sum with an answer similarity score using embeddings. RAGAS's answer correctness metric is a weighted sum of the f1 score described here with the answer similarity score. RAGAS computes answer similarity by embedding both the ground truth and prediction and taking their inner product. They use default weights of 0.75 for the f1 score and 0.25 for the answer similarity score. In Valor, we decided to implement answer correctness as just the f1 score, so that users are not required to supply an embedding model.

Answer Relevance (LLM-Guided)

Uses - Datum.text - Prediction - Annotation.text

Answer relevance is the proportion of statements in the answer that are relevant to the query. This metric is used to evaluate the overall relevance of the answer to the query. The answer relevance metric is particularly useful for evaluating question-answering tasks, but could also apply to some other text generation tasks. This metric is not recommended for more open ended tasks.

First, an LLM is prompted to extract statements from the predicted text. Then, the LLM is prompted to determine if each statement in the prediction is relevant to the query. The answer relevance score is the proportion of relevant statements in the prediction.

\[AnswerRelevance = \frac{\textnormal{Number of Relevant Statements}}{\textnormal{Total Number of Statements}}\]

Our implementation closely follows DeepEval's implementation. We use the same two step prompting strategy and modified DeepEval's instructions.

RAG Metrics

Retrieval Augmented Generation (RAG) is a subcategory of Q&A where the model retrieves contexts from a database, then uses the retrieved contexts to aid in generating an answer. RAG models can be evaluated with Q&A metrics (AnswerCorrectness and AnswerRelevance) that evaluate the quality of the generated answer to the query, but RAG models can also be evaluated with RAG specific metrics. Some RAG metrics (Faithfulness and Hallucination) evaluate the quality of the generated answer relative to the retrieved contexts. Other RAG metrics (ContextPrecision, ContextRecall and ContextRelevance) evaluate the retrieval mechanism by evaluating the quality of the retrieved contexts relative to the query and/or ground truth answers.

Context Precision (LLM-Guided)

Uses - Datum.text - GroundTruth - Annotation.text - Prediction - Annotation.context

Context precision is an LLM-guided metric that uses the query, an ordered list of retrieved contexts and a ground truth to evaluate a RAG retrieval mechanism.

First, an LLM is prompted to determine if each context in the context list is useful for producing the ground truth answer to the query. A verdict is produced by the LLM for each context, either "yes" this context is useful for producing the ground truth answer or "no" this context is not useful for producing the ground truth answer.

Second, the list of verdicts is used to compute the context precision score. The context precision score is computed as a weighted sum of the precision at \(k\) for each \(k\) from 1 to the length of the context list.

The precision at \(k\) is the proportion of "yes" verdicts amongst the first \(k\) contexts. Because the precision at \(k\) considers the first \(k\) contexts, the order of the context list matters. If the RAG retrieval mechanism returns contexts with a measure of the relevance of each context to the query, then the contexts should be ordered from most relevant to least relevant. The formula for precision at \(k\) is:

\[Precision@k = \frac{1}{k}\sum_{i=1}^kv_i\]

where \(v_i\) is 1 if the \(i\) th verdict is "yes" and 0 if the \(i\) th verdict is "no".

The context precision score is computed by adding up all the precision at \(k\) for which the \(k\) verdict is "yes", then dividing by the total number of contexts for which the verdict is "yes". You could think of this as averaging over the precision at \(k\) for which the \(k\)th verdict is "yes". As an edge case, if all of the verdicts are "no", then the score is 0. If the total number of contexts is \(K\), then the formula for context precision is:

\[Context Precision = \frac{\sum_{k=1}^K(Precision@k \times v_k)}{\sum_{k=1}^Kv_k}\]

Note that context precision evaluates not just which contexts are retrieved, but the order of those contexts. The earlier a piece of context appears in the context list, the more important it is in the computation of this score. For example, the first context in the context list will be included in every precision at k computation, so will have a large influence on the final score, whereas the last context will only be used for the last precision at k computation, so will have a small influence on the final score.

As an example, suppose there are 4 contexts and the verdicts are ["yes", "no", "no", "yes"]. The precision at 1 is 1 and the precision at 4 is 0.5. The context precision score is then (1 + 0.5) / 2 = 0.75. If instead the verdicts were ["no", "yes", "no", "yes"], then the precision at 2 is 0.5 and the precision at 4 is 0.5, so the context precision score is (0.5 + 0.5) / 2 = 0.5. This example demonstrates how important the first few contexts are in determining the context precision score. Just swapping the first two contexts had a large impact on the score.

If multiple ground truth answers are provided for a datum, then the verdict for each context is "yes" if the verdict for that context is "yes" for any ground truth. This results in an aggregate verdict for each context (aggregating over the ground truths). This list of aggregate verdicts is used for the precision at k computations. Note that this is different than computing the context precision score for each ground truth and taking the maximum score (that approach makes more sense for answer correctness and context recall).

Our implementation uses the same computation as both RAGAS and DeepEval. Our instruction is loosely adapted from DeepEval's instruction.

Context Recall (LLM-Guided)

Uses - GroundTruth - Annotation.text - Prediction - Annotation.context

Context recall is an LLM-guided metric that uses a list of retrieved contexts and a ground truth answer to a query to evaluate a RAG retrieval mechanism. Context recall is the proportion of ground truth statements that are attributable to the context list.

First, an LLM is prompted to extract a list of statements made in the ground truth answer. Second, the LLM is prompted with the context list and the list of ground truth statements to determine if each ground truth statement could be attributed to the context list. The number of ground truth statements that could be attributed to the context list is divided by the total number of ground truth statements to get the context recall score.

\[Context Recall = \frac{\textnormal{Number of Ground Truth Statements Attributable to Context List}}{\textnormal{Total Number of Ground Truth Statements}}\]

If multiple ground truth answers are provided for a datum, then the context recall score is computed for each ground truth answer and the maximum score is taken. Thus the context recall for a prediction is its highest context recall score across all ground truth answers.

Our implementation loosely follows RAGAS. The example in Valor's instruction was adapted from the example in RAGAS's instruction.

Context Relevance (LLM-Guided)

Uses - Datum.text - Prediction - Annotation.context

Context relevance is an LLM-guided metric that uses a query and a list of retrieved contexts to evaluate a RAG retrieval mechanism. Context relevance is the proportion of pieces of retrieved contexts that are relevant to the query. A piece of context is considered relevant to the query if any part of the context is relevant to answering the query. For example, a piece of context might be a paragraph of text, so if the answer or part of the answer to a query is contained somewhere in that paragraph, then that piece of context is considered relevant.

First, an LLM is prompted to determine if each piece of context is relevant to the query. Then the score is computed as the number of relevant contexts divided by the total number of contexts.

\[Context Relevance = \frac{\textnormal{Number of Relevant Contexts}}{\textnormal{Total Number of Contexts}}\]

Our implementation follows DeepEval's implementation. The LLM instruction was adapted from DeepEval's instruction.

Faithfulness (LLM-Guided)

Uses - Prediction - Annotation.text - Prediction - Annotation.context

Faithfulness is the proportion of claims from the predicted text that are implied by the retrieved contexts.

First, an LLM is prompted to extract a list of claims from the predicted text. Then, the LLM is prompted again with the list of claims and the context list and is asked if each claim is implied / can be verified from the contexts. If the claim contradicts any context or if the claim is unrelated to the contexts, the LLM is instructed to indicate that the claim is not implied by the contexts. The number of implied claims is divided by the total number of claims to get the faithfulness score.

\[Faithfulness = \frac{\textnormal{Number of Implied Claims}}{\textnormal{Total Number of Claims}}\]

Our implementation loosely follows and combines the strategies of DeepEval and RAGAS, however it is notable that DeepEval and RAGAS's definitions of faithfulness are not equivalent. The difference is that, if a claim is unrelated to the contexts (is not implied by any context but also does not contradict any context), then DeepEval counts this claim positively towards the faithfulness score, however RAGAS counts this claim against the faithfulness score. Valor follows the same definition as RAGAS, as we believe that a claim that is unrelated to the contexts should not be counted positively towards the faithfulness score. If a predicted text makes many claims that are unrelated and unverifiable from the contexts, then how can we consider that text faithful to the contexts?

We follow DeepEval's prompting strategy as this strategy is closer to the other prompting strategies in Valor, however we heavily modify the instructions. Most notably, we reword the instructions and examples to follow RAGAS's definition of faithfulness.

Hallucination (LLM-Guided)

Uses - Prediction - Annotation.text - Prediction - Annotation.context

Hallucination is the proportion of contexts that are contradicted by the predicted text. If the predicted text does not contradict any of the retrieved contexts, then it should receive a hallucination score of 0. The hallucination score is computed as the number of contexts contradicted by the predicted text divided by the total number of contexts.

Given the context list and the predicted text, an LLM is prompted to determine if the text agrees or contradicts with each piece of context. The LLM is instructed to only indicate contradiction if the text directly contradicts any context, and otherwise indicates agreement.

\[Hallucination = \frac{\textnormal{Number of Contradicted Contexts}}{\textnormal{Total Number of Contexts}}\]

Note the differences between faithfulness and hallucination. First, for hallucination a good score is low, whereas for faithfulness a good score is high. Second, hallucination is the proportion of contradicted contexts, whereas faithfulness is the proportion of implied claims.

Our implementation follows DeepEval's implementation.

Summarization Metrics

Summarization is the task of generating a shorter version of a piece of text that retains the most important information. Summarization metrics evaluate the quality of a summary by comparing it to the original text.

Note that Datum.text is used differently for summarization than for Q&A and RAG tasks. For summarization, the Datum.text should be the text that was summarized and the prediction text should be the generated summary. This is different than Q&A and RAG where the Datum.text is the query and the prediction text is the generated answer.

Summary Coherence (LLM-Guided)

Uses - Datum.text - Prediction - Annotation.text

Summary coherence is an LLM-guided metric that measures the collective quality of a summary on an integer scale of 1 to 5, where 5 indicates the highest summary coherence. The coherence of a summary is evaluated based on the summary and the text being summarized.

An LLM is prompted to evaluate the collective quality of a summary given the text being summarized. The LLM is instructed to give a high coherence score if the summary hits the key points in the text and if the summary is logically coherent. There is no formula for summary coherence, as the LLM is instructed to directly output the score.

Valor's implementation of the summary coherence metric uses an instruction that was adapted from appendix A of DeepEval's paper G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment. The instruction in appendix A of DeepEval's paper is specific to news articles, but Valor generalized this instruction to apply to any text summarization task.

Non-LLM-Guided Text Comparison Metrics

This section contains non-LLM-guided metrics for comparing a predicted text to one or more ground truth texts. These metrics can be run without specifying any LLM api parameters.

ROUGE

Uses - GroundTruth - Annotation.text - Prediction - Annotation.text

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

In Valor, the ROUGE output value is a dictionary containing the following elements:

{
    "rouge1": 0.18, # unigram-based similarity scoring
    "rouge2": 0.08, # bigram-based similarity scoring
    "rougeL": 0.18, # similarity scoring based on sentences (i.e., splitting on "." and ignoring "\n")
    "rougeLsum": 0.18, # similarity scoring based on splitting the text using "\n"
}

Behind the scenes, we use Hugging Face's evaluate package to calculate these scores. Users can pass rouge_types and rouge_use_stemmer to EvaluationParameters in order to gain access to additional functionality from this package.

BLEU

Uses - GroundTruth - Annotation.text - Prediction - Annotation.text

BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating automatic summarization and machine translation software in natural language processing. BLEU's output is always a number between 0 and 1, where a score near 1 indicates that the hypothesis text is very similar to one or more of the reference texts.

Behind the scenes, we use nltk.translate.bleu_score to calculate these scores. The default BLEU metric calculates a score for up to 4-grams using uniform weights (i.e., weights=[.25, .25, .25, .25]; also called BLEU-4). Users can pass their own bleu_weights to EvaluationParameters in order to change this default behavior and calculate other BLEU scores.