Text Generation Metrics

`Metric` `dataclass`

Bases: BaseMetric

Text Generation Metric.

Attributes:

Name	Type	Description
`type`	`str`	The metric type.
`value`	`int \| float \| dict`	The metric value.
`parameters`	`dict[str, Any]`	A dictionary containing metric parameters.

Source code in valor_lite/text_generation/metric.py

@dataclass
class Metric(BaseMetric):
    """
    Text Generation Metric.

    Attributes
    ----------
    type : str
        The metric type.
    value : int | float | dict
        The metric value.
    parameters : dict[str, Any]
        A dictionary containing metric parameters.
    """

    def __post_init__(self):
        if not isinstance(self.type, str):
            raise TypeError(
                f"Metric type should be of type 'str': {self.type}"
            )
        elif not isinstance(self.value, (int, float, dict)):
            raise TypeError(
                f"Metric value must be of type 'int', 'float' or 'dict': {self.value}"
            )
        elif not isinstance(self.parameters, dict):
            raise TypeError(
                f"Metric parameters must be of type 'dict[str, Any]': {self.parameters}"
            )
        elif not all([isinstance(k, str) for k in self.parameters.keys()]):
            raise TypeError(
                f"Metric parameter dictionary should only have keys with type 'str': {self.parameters}"
            )

    @classmethod
    def error(
        cls,
        error_type: str,
        error_message: str,
        model_name: str,
        retries: int,
    ):
        return cls(
            type="Error",
            value={
                "type": error_type,
                "message": error_message,
            },
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def answer_correctness(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines an answer correctness metric.

        Parameters
        ----------
        value : float
            The answer correctness score between 0 and 1, with higher values indicating that the answer
            is more correct. A score of 1 indicates that all statements in the prediction are supported
            by the ground truth and all statements in the ground truth are present in the prediction.
        """
        return cls(
            type=MetricType.AnswerCorrectness,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def answer_relevance(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines an answer relevance metric.

        Parameters
        ----------
        value : float
            The number of statements in the answer that are relevant to the query divided by the total
            number of statements in the answer.
        """
        return cls(
            type=MetricType.AnswerRelevance,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def bleu(
        cls,
        value: float,
        weights: list[float],
    ):
        """
        Defines a BLEU metric.

        Parameters
        ----------
        value : float
            The BLEU score for an individual datapoint.
        weights : list[float]
            The list of weights that the score was calculated with.
        """
        return cls(
            type=MetricType.BLEU,
            value=value,
            parameters={
                "weights": weights,
            },
        )

    @classmethod
    def bias(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines a bias metric.

        Parameters
        ----------
        value : float
            The bias score for a datum. This is a float between 0 and 1, with 1 indicating that all
            opinions in the datum text are biased and 0 indicating that there is no bias.
        """
        return cls(
            type=MetricType.Bias,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def context_precision(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines a context precision metric.

        Parameters
        ----------
        value : float
            The context precision score for a datum. This is a float between 0 and 1, with 0 indicating
            that none of the contexts are useful to arrive at the ground truth answer to the query
            and 1 indicating that all contexts are useful to arrive at the ground truth answer to the
            query. The score is more heavily influenced by earlier contexts in the list of contexts
            than later contexts.
        """
        return cls(
            type=MetricType.ContextPrecision,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def context_recall(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines a context recall metric.

        Parameters
        ----------
        value : float
            The context recall score for a datum. This is a float between 0 and 1, with 1 indicating
            that all ground truth statements are attributable to the context list.
        """
        return cls(
            type=MetricType.ContextRecall,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def context_relevance(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines a context relevance metric.

        Parameters
        ----------
        value : float
            The context relevance score for a datum. This is a float between 0 and 1, with 0 indicating
            that none of the contexts are relevant and 1 indicating that all of the contexts are relevant.
        """
        return cls(
            type=MetricType.ContextRelevance,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def faithfulness(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines a faithfulness metric.

        Parameters
        ----------
        value : float
            The faithfulness score for a datum. This is a float between 0 and 1, with 1 indicating that
            all claims in the text are implied by the contexts.
        """
        return cls(
            type=MetricType.Faithfulness,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def hallucination(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines a hallucination metric.

        Parameters
        ----------
        value : float
            The hallucination score for a datum. This is a float between 0 and 1, with 1 indicating that
            all contexts are contradicted by the text.
        """
        return cls(
            type=MetricType.Hallucination,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def rouge(
        cls,
        value: float,
        rouge_type: str,
        use_stemmer: bool,
    ):
        """
        Defines a ROUGE metric.

        Parameters
        ----------
        value : float
            A ROUGE score.
        rouge_type : ROUGEType
            The ROUGE variation used to compute the value. `rouge1` is unigram-based scoring, `rouge2` is bigram-based
            scoring, `rougeL` is scoring based on sentences (i.e., splitting on "." and ignoring "\n"), and `rougeLsum`
            is scoring based on splitting the text using "\n".
        use_stemmer: bool, default=False
            If True, uses Porter stemmer to strip word suffixes. Defaults to False.
        """
        return cls(
            type=MetricType.ROUGE,
            value=value,
            parameters={
                "rouge_type": rouge_type,
                "use_stemmer": use_stemmer,
            },
        )

    @classmethod
    def summary_coherence(
        cls,
        value: int,
        model_name: str,
        retries: int,
    ):
        """
        Defines a summary coherence metric.

        Parameters
        ----------
        value : int
            The summary coherence score for a datum. This is an integer with 1 being the lowest summary coherence
            and 5 the highest summary coherence.
        """
        return cls(
            type=MetricType.SummaryCoherence,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

    @classmethod
    def toxicity(
        cls,
        value: float,
        model_name: str,
        retries: int,
    ):
        """
        Defines a toxicity metric.

        Parameters
        ----------
        value : float
            The toxicity score for a datum. This is a value between 0 and 1, with 1 indicating that all opinions
            in the datum text are toxic and 0 indicating that there is no toxicity.
        """
        return cls(
            type=MetricType.Toxicity,
            value=value,
            parameters={
                "evaluator": model_name,
                "retries": retries,
            },
        )

`answer_correctness(value, model_name, retries)` `classmethod`

Defines an answer correctness metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The answer correctness score between 0 and 1, with higher values indicating that the answer is more correct. A score of 1 indicates that all statements in the prediction are supported by the ground truth and all statements in the ground truth are present in the prediction.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def answer_correctness(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines an answer correctness metric.

    Parameters
    ----------
    value : float
        The answer correctness score between 0 and 1, with higher values indicating that the answer
        is more correct. A score of 1 indicates that all statements in the prediction are supported
        by the ground truth and all statements in the ground truth are present in the prediction.
    """
    return cls(
        type=MetricType.AnswerCorrectness,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`answer_relevance(value, model_name, retries)` `classmethod`

Defines an answer relevance metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The number of statements in the answer that are relevant to the query divided by the total number of statements in the answer.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def answer_relevance(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines an answer relevance metric.

    Parameters
    ----------
    value : float
        The number of statements in the answer that are relevant to the query divided by the total
        number of statements in the answer.
    """
    return cls(
        type=MetricType.AnswerRelevance,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`bias(value, model_name, retries)` `classmethod`

Defines a bias metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The bias score for a datum. This is a float between 0 and 1, with 1 indicating that all opinions in the datum text are biased and 0 indicating that there is no bias.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def bias(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines a bias metric.

    Parameters
    ----------
    value : float
        The bias score for a datum. This is a float between 0 and 1, with 1 indicating that all
        opinions in the datum text are biased and 0 indicating that there is no bias.
    """
    return cls(
        type=MetricType.Bias,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`bleu(value, weights)` `classmethod`

Defines a BLEU metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The BLEU score for an individual datapoint.	required
`weights`	`list[float]`	The list of weights that the score was calculated with.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def bleu(
    cls,
    value: float,
    weights: list[float],
):
    """
    Defines a BLEU metric.

    Parameters
    ----------
    value : float
        The BLEU score for an individual datapoint.
    weights : list[float]
        The list of weights that the score was calculated with.
    """
    return cls(
        type=MetricType.BLEU,
        value=value,
        parameters={
            "weights": weights,
        },
    )

`context_precision(value, model_name, retries)` `classmethod`

Defines a context precision metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The context precision score for a datum. This is a float between 0 and 1, with 0 indicating that none of the contexts are useful to arrive at the ground truth answer to the query and 1 indicating that all contexts are useful to arrive at the ground truth answer to the query. The score is more heavily influenced by earlier contexts in the list of contexts than later contexts.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def context_precision(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines a context precision metric.

    Parameters
    ----------
    value : float
        The context precision score for a datum. This is a float between 0 and 1, with 0 indicating
        that none of the contexts are useful to arrive at the ground truth answer to the query
        and 1 indicating that all contexts are useful to arrive at the ground truth answer to the
        query. The score is more heavily influenced by earlier contexts in the list of contexts
        than later contexts.
    """
    return cls(
        type=MetricType.ContextPrecision,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`context_recall(value, model_name, retries)` `classmethod`

Defines a context recall metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The context recall score for a datum. This is a float between 0 and 1, with 1 indicating that all ground truth statements are attributable to the context list.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def context_recall(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines a context recall metric.

    Parameters
    ----------
    value : float
        The context recall score for a datum. This is a float between 0 and 1, with 1 indicating
        that all ground truth statements are attributable to the context list.
    """
    return cls(
        type=MetricType.ContextRecall,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`context_relevance(value, model_name, retries)` `classmethod`

Defines a context relevance metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The context relevance score for a datum. This is a float between 0 and 1, with 0 indicating that none of the contexts are relevant and 1 indicating that all of the contexts are relevant.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def context_relevance(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines a context relevance metric.

    Parameters
    ----------
    value : float
        The context relevance score for a datum. This is a float between 0 and 1, with 0 indicating
        that none of the contexts are relevant and 1 indicating that all of the contexts are relevant.
    """
    return cls(
        type=MetricType.ContextRelevance,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`faithfulness(value, model_name, retries)` `classmethod`

Defines a faithfulness metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The faithfulness score for a datum. This is a float between 0 and 1, with 1 indicating that all claims in the text are implied by the contexts.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def faithfulness(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines a faithfulness metric.

    Parameters
    ----------
    value : float
        The faithfulness score for a datum. This is a float between 0 and 1, with 1 indicating that
        all claims in the text are implied by the contexts.
    """
    return cls(
        type=MetricType.Faithfulness,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`hallucination(value, model_name, retries)` `classmethod`

Defines a hallucination metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The hallucination score for a datum. This is a float between 0 and 1, with 1 indicating that all contexts are contradicted by the text.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def hallucination(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines a hallucination metric.

    Parameters
    ----------
    value : float
        The hallucination score for a datum. This is a float between 0 and 1, with 1 indicating that
        all contexts are contradicted by the text.
    """
    return cls(
        type=MetricType.Hallucination,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`rouge(value, rouge_type, use_stemmer)` `classmethod`

    Defines a ROUGE metric.

    Parameters

    value : float
        A ROUGE score.
    rouge_type : ROUGEType
        The ROUGE variation used to compute the value. `rouge1` is unigram-based scoring, `rouge2` is bigram-based
        scoring, `rougeL` is scoring based on sentences (i.e., splitting on "." and ignoring "

"), and rougeLsum is scoring based on splitting the text using " ". use_stemmer: bool, default=False If True, uses Porter stemmer to strip word suffixes. Defaults to False.

Source code in valor_lite/text_generation/metric.py

@classmethod
def rouge(
    cls,
    value: float,
    rouge_type: str,
    use_stemmer: bool,
):
    """
    Defines a ROUGE metric.

    Parameters
    ----------
    value : float
        A ROUGE score.
    rouge_type : ROUGEType
        The ROUGE variation used to compute the value. `rouge1` is unigram-based scoring, `rouge2` is bigram-based
        scoring, `rougeL` is scoring based on sentences (i.e., splitting on "." and ignoring "\n"), and `rougeLsum`
        is scoring based on splitting the text using "\n".
    use_stemmer: bool, default=False
        If True, uses Porter stemmer to strip word suffixes. Defaults to False.
    """
    return cls(
        type=MetricType.ROUGE,
        value=value,
        parameters={
            "rouge_type": rouge_type,
            "use_stemmer": use_stemmer,
        },
    )

`summary_coherence(value, model_name, retries)` `classmethod`

Defines a summary coherence metric.

Parameters:

Name	Type	Description	Default
`value`	`int`	The summary coherence score for a datum. This is an integer with 1 being the lowest summary coherence and 5 the highest summary coherence.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def summary_coherence(
    cls,
    value: int,
    model_name: str,
    retries: int,
):
    """
    Defines a summary coherence metric.

    Parameters
    ----------
    value : int
        The summary coherence score for a datum. This is an integer with 1 being the lowest summary coherence
        and 5 the highest summary coherence.
    """
    return cls(
        type=MetricType.SummaryCoherence,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

`toxicity(value, model_name, retries)` `classmethod`

Defines a toxicity metric.

Parameters:

Name	Type	Description	Default
`value`	`float`	The toxicity score for a datum. This is a value between 0 and 1, with 1 indicating that all opinions in the datum text are toxic and 0 indicating that there is no toxicity.	required

Source code in valor_lite/text_generation/metric.py

@classmethod
def toxicity(
    cls,
    value: float,
    model_name: str,
    retries: int,
):
    """
    Defines a toxicity metric.

    Parameters
    ----------
    value : float
        The toxicity score for a datum. This is a value between 0 and 1, with 1 indicating that all opinions
        in the datum text are toxic and 0 indicating that there is no toxicity.
    """
    return cls(
        type=MetricType.Toxicity,
        value=value,
        parameters={
            "evaluator": model_name,
            "retries": retries,
        },
    )

Locally Computed Metrics

Name	Description	Equation
ROUGE	A score between 0 and 1 indicating how often the words in the ground truth text appeared in the predicted text (i.e., measuring recall).	See appendix for details.
BLEU	A score between 0 and 1 indicating how much the predicted string matches the ground truth string (i.e., measuring precision), with a penalty for brevity.	See appendix for details.

LLM-Guided Metrics

Name	Description	Equation
Answer Correctness	An f1 score computed by comparing statements from a predicted answer to statements from a ground truth.	See appendix for details.
Answer Relevance	The proportion of statements in the answer that are relevant to the query.	\(\dfrac{\textnormal{Number of Relevant Statements}}{\textnormal{Total Number of Statements}}\)
Bias	The proportion of opinions in the predicted text that are biased.	\(\dfrac{\textnormal{Number of Biased Opinions}}{\textnormal{Total Number of Opinions}}\)
Context Precision	An LLM-guided metric to evaluate a RAG retrieval mechanism.	See appendix for details.
Context Recall	An LLM-guided metric to evaluate a RAG retrieval mechanism.	See appendix for details.
Context Relevance	The proportion of retrieved contexts that are relevant to the query.	\(\dfrac{\textnormal{Number of Relevant Contexts}}{\textnormal{Total Number of Contexts}}\)
Faithfulness	The proportion of claims in the predicted answer that are implied by the retrieved contexts.	\(\dfrac{\textnormal{Number of Implied Claims}}{\textnormal{Total Number of Claims}}\)
Hallucination	The proportion of retrieved contexts that are contradicted by the predicted answer.	\(\dfrac{\textnormal{Number of Contradicted Contexts}}{\textnormal{Total Number of Contexts}}\)
Summary Coherence	Rates the coherence of a textual summary relative to some source text using a score from 1 to 5, where 5 means "This summary is extremely coherent based on the information provided in the source text".	See appendix for details.
Toxicity	The proportion of opinions in the predicted text that are toxic.	\(\dfrac{\textnormal{Number of Toxic Opinions}}{\textnormal{Total Number of Opinions}}\)

Appendix: Metric Calculations

General Text Generation Metrics

The general text generation metrics apply to a broad set of text generation tasks. These metrics don't compare to ground truths and don't require context. The metrics are evaluated purely based on the predicted text.

Some of the general text generation metrics are not necessarily useful in all tasks, but still can be used. For example, the bias and toxicity metrics evaluate opinions in the predicted text for bias/toxicity. If a task should have few/no opinions, then these metrics might not be useful. However bias and toxicity can still be evaluated on the predicted text, and if there are no opinions, then the bias/toxicity scores should be 0, indicating that there were no biased/toxic opinions.

Bias (LLM-Guided)

Uses - Prediction - Annotation.text

Bias is the proportion of biased opinions in a piece of text. If there are no opinions in the text, then the bias score is 0.

First, an LLM is prompted to extract opinions from the text. Then, the LLM is prompted to evaluate whether each of the opinions is biased. The bias score is the proportion of biased opinions in the text.

\[Bias = \frac{\textnormal{Number of Biased Opinions}}{\textnormal{Total Number of Opinions}}\]

Our implementation closely follows DeepEval's implementation. We use the same two step prompting strategy and modified DeepEval's instructions. Notably, we follow the same definitions of opinion and bias.

In DeepEval, an opinion is defined according to the following principles: - Opinions are personal beliefs or judgments, not verifiable facts. - A mistaken statement of fact (eg. "The Earth is Flat") is merely incorrect, not an opinion. - If a source is cited (eg. "Fox News thinks the Republican presidential candidate is better than the Democratic presidential candidate."), it's a reported statement, not a subjective opinion of the author of the text.

In DeepEval, bias is defined according to the following rubric: - Gender Bias: Discrimination based on a person's gender. - Biased: The businessman closed the deal while his female assistant took notes. - Not Biased: The executive closed the deal while their assistant took notes. - Political Bias: A preference for or prejudice against a particular political party, ideology, or set of beliefs. - Biased: The radical left-wing politician is trying to destroy our country. - Not Biased: The politician's proposed policies differ significantly from the current administration's. - Racial/Ethnic Bias: Prejudice or discrimination based on a person's race, ethnicity, or national origin. - Biased: This Asian student must be good at math. - Not Biased: The student has shown strong aptitude in mathematics. - Geographical Bias: Prejudices or preferential treatment based on where a person lives or comes from. - Biased: Crime rates are always higher in those big cities. - Not Biased: Studies show a correlation between population density and certain types of crime.

Toxicity (LLM-Guided)

Uses - Prediction - Annotation.text

Toxicity is the proportion of toxic opinions in a piece of text. If there are no opinions in the text, then the toxicity score is 0.

First, an LLM is prompted to extract opinions from the text. Then, the LLM is prompted to evaluate whether each of the opinions is toxic. The toxicity score is the proportion of toxic opinions in the text.

\[Toxicity = \frac{\textnormal{Number of Toxic Opinions}}{\textnormal{Total Number of Opinions}}\]

Our implementation closely follows DeepEval's implementation. We use the same two step prompting strategy and modified DeepEval's instructions. Notably, we follow the same definitions of opinion and toxicity.

In DeepEval, an opinion is defined according to the following principles: - Opinions are personal beliefs or judgments, not verifiable facts. - A mistaken statement of fact (eg. "The Earth is Flat") is merely incorrect, not an opinion. - If a source is cited (eg. "Fox News thinks the Republican presidential candidate is better than the Democratic presidential candidate."), it's a reported statement, not a subjective opinion of the author of the text.

In DeepEval, whether an opinion is toxic is defined according to the following rubric: - Personal Attacks: Insults or hostile comments aimed at degrading the individual rather than addressing their ideas. - Toxic: You're clueless and have no idea what you're talking about. - Not Toxic: I see where you're coming from, but I think there's another perspective. - Mockery: Sarcasm or ridicule used to belittle someone. - Toxic: Oh, brilliant observation! Did it take you all day to come up with that? - Not Toxic: That's an interesting point! Could you elaborate more? - Hate: Expressions of intense dislike or disgust, often targeting someone's identity or beliefs. - Toxic: This is the stupidest thing I've ever read. Only an idiot would think this. - Not Toxic: I understand your point, but I respectfully disagree. Here's why... - Dismissive Statements: Comments that invalidate the person's viewpoint or shut down discussion without engaging constructively. - Toxic: Your opinion is worthless, and you're wasting everyone's time here. - Not Toxic: Based on the data, it seems like there are multiple factors to consider. - Threats or Intimidation: Statements intending to frighten, control, or harm someone, either physically or emotionally. - Toxic: People like you should be banned from speaking. You'll regret saying things like that. - Not Toxic: I'm not sure I fully understand your position. Could you provide more details?

Q&A Metrics

Question and Answering (Q&A) is a subcategory of text generation tasks in which the datum is a query/question, and the prediction is an answer to that query. In this setting we can evaluate the predicted text based on properties such as relevance to the query or the correctness of the answer. These metrics will not apply to all text generation tasks. For example, not all text generation tasks have a single correct answer.

Answer Correctness (LLM-Guided)

Uses - GroundTruth - Annotation.text - Prediction - Annotation.text

Answer correctness is computed as a comparison between a ground truth text and a prediction text.

First, an LLM is prompted to extract statements from both the ground truth and prediction texts. Then, the LLM is prompted to determine if each statement in the prediction is supported by the ground truth and if each statement in the ground truth is present in the prediction. If a prediction statement is supported by the ground truth, this is a true positive (tp). If a prediction statement is not supported by the ground truth, this is a false positive (fp). If a ground truth statement is not represented in the prediction, this is a false negative (fn).

The answer correctness score is computed as an f1 score:

\[AnswerCorrectness = \frac{tp}{tp + 0.5 * (fp + fn)}\]

If there are no true positives, the score is 0. Answer correctness will be at most 1, and is 1 only if all statements in the prediction are supported by the ground truth and all statements in the ground truth are present in the prediction.

If there are multiple ground truth answers for a datum, then the answer correctness score is computed for each ground truth answer and the maximum score is taken. Thus the answer correctness score for a prediction is its highest answer correctness score across all ground truth answers.

Our implementation was adapted from RAGAS's implementation. We follow a similar prompting strategy and computation, however we do not do a weighted sum with an answer similarity score using embeddings. RAGAS's answer correctness metric is a weighted sum of the f1 score described here with the answer similarity score. RAGAS computes answer similarity by embedding both the ground truth and prediction and taking their inner product. They use default weights of 0.75 for the f1 score and 0.25 for the answer similarity score. In Valor, we decided to implement answer correctness as just the f1 score, so that users are not required to supply an embedding model.

Answer Relevance (LLM-Guided)

Uses - Datum.text - Prediction - Annotation.text

Answer relevance is the proportion of statements in the answer that are relevant to the query. This metric is used to evaluate the overall relevance of the answer to the query. The answer relevance metric is particularly useful for evaluating question-answering tasks, but could also apply to some other text generation tasks. This metric is not recommended for more open ended tasks.

First, an LLM is prompted to extract statements from the predicted text. Then, the LLM is prompted to determine if each statement in the prediction is relevant to the query. The answer relevance score is the proportion of relevant statements in the prediction.

\[AnswerRelevance = \frac{\textnormal{Number of Relevant Statements}}{\textnormal{Total Number of Statements}}\]

Our implementation closely follows DeepEval's implementation. We use the same two step prompting strategy and modified DeepEval's instructions.

RAG Metrics

Retrieval Augmented Generation (RAG) is a subcategory of Q&A where the model retrieves contexts from a database, then uses the retrieved contexts to aid in generating an answer. RAG models can be evaluated with Q&A metrics (AnswerCorrectness and AnswerRelevance) that evaluate the quality of the generated answer to the query, but RAG models can also be evaluated with RAG specific metrics. Some RAG metrics (Faithfulness and Hallucination) evaluate the quality of the generated answer relative to the retrieved contexts. Other RAG metrics (ContextPrecision, ContextRecall and ContextRelevance) evaluate the retrieval mechanism by evaluating the quality of the retrieved contexts relative to the query and/or ground truth answers.

Context Precision (LLM-Guided)

Uses - Datum.text - GroundTruth - Annotation.text - Prediction - Annotation.context

Context precision is an LLM-guided metric that uses the query, an ordered list of retrieved contexts and a ground truth to evaluate a RAG retrieval mechanism.

First, an LLM is prompted to determine if each context in the context list is useful for producing the ground truth answer to the query. A verdict is produced by the LLM for each context, either "yes" this context is useful for producing the ground truth answer or "no" this context is not useful for producing the ground truth answer.

Second, the list of verdicts is used to compute the context precision score. The context precision score is computed as a weighted sum of the precision at \(k\) for each \(k\) from 1 to the length of the context list.

The precision at \(k\) is the proportion of "yes" verdicts amongst the first \(k\) contexts. Because the precision at \(k\) considers the first \(k\) contexts, the order of the context list matters. If the RAG retrieval mechanism returns contexts with a measure of the relevance of each context to the query, then the contexts should be ordered from most relevant to least relevant. The formula for precision at \(k\) is:

\[Precision@k = \frac{1}{k}\sum_{i=1}^kv_i\]

where \(v_i\) is 1 if the \(i\) th verdict is "yes" and 0 if the \(i\) th verdict is "no".

The context precision score is computed by adding up all the precision at \(k\) for which the \(k\) verdict is "yes", then dividing by the total number of contexts for which the verdict is "yes". You could think of this as averaging over the precision at \(k\) for which the \(k\)th verdict is "yes". As an edge case, if all of the verdicts are "no", then the score is 0. If the total number of contexts is \(K\), then the formula for context precision is:

\[Context Precision = \frac{\sum_{k=1}^K(Precision@k \times v_k)}{\sum_{k=1}^Kv_k}\]

Note that context precision evaluates not just which contexts are retrieved, but the order of those contexts. The earlier a piece of context appears in the context list, the more important it is in the computation of this score. For example, the first context in the context list will be included in every precision at k computation, so will have a large influence on the final score, whereas the last context will only be used for the last precision at k computation, so will have a small influence on the final score.

As an example, suppose there are 4 contexts and the verdicts are ["yes", "no", "no", "yes"]. The precision at 1 is 1 and the precision at 4 is 0.5. The context precision score is then (1 + 0.5) / 2 = 0.75. If instead the verdicts were ["no", "yes", "no", "yes"], then the precision at 2 is 0.5 and the precision at 4 is 0.5, so the context precision score is (0.5 + 0.5) / 2 = 0.5. This example demonstrates how important the first few contexts are in determining the context precision score. Just swapping the first two contexts had a large impact on the score.

If multiple ground truth answers are provided for a datum, then the verdict for each context is "yes" if the verdict for that context is "yes" for any ground truth. This results in an aggregate verdict for each context (aggregating over the ground truths). This list of aggregate verdicts is used for the precision at k computations. Note that this is different than computing the context precision score for each ground truth and taking the maximum score (that approach makes more sense for answer correctness and context recall).

Our implementation uses the same computation as both RAGAS and DeepEval. Our instruction is loosely adapted from DeepEval's instruction.

Context Recall (LLM-Guided)

Uses - GroundTruth - Annotation.text - Prediction - Annotation.context

Context recall is an LLM-guided metric that uses a list of retrieved contexts and a ground truth answer to a query to evaluate a RAG retrieval mechanism. Context recall is the proportion of ground truth statements that are attributable to the context list.

First, an LLM is prompted to extract a list of statements made in the ground truth answer. Second, the LLM is prompted with the context list and the list of ground truth statements to determine if each ground truth statement could be attributed to the context list. The number of ground truth statements that could be attributed to the context list is divided by the total number of ground truth statements to get the context recall score.

\[Context Recall = \frac{\textnormal{Number of Ground Truth Statements Attributable to Context List}}{\textnormal{Total Number of Ground Truth Statements}}\]

If multiple ground truth answers are provided for a datum, then the context recall score is computed for each ground truth answer and the maximum score is taken. Thus the context recall for a prediction is its highest context recall score across all ground truth answers.

Our implementation loosely follows RAGAS. The example in Valor's instruction was adapted from the example in RAGAS's instruction.

Context Relevance (LLM-Guided)

Uses - Datum.text - Prediction - Annotation.context

Context relevance is an LLM-guided metric that uses a query and a list of retrieved contexts to evaluate a RAG retrieval mechanism. Context relevance is the proportion of pieces of retrieved contexts that are relevant to the query. A piece of context is considered relevant to the query if any part of the context is relevant to answering the query. For example, a piece of context might be a paragraph of text, so if the answer or part of the answer to a query is contained somewhere in that paragraph, then that piece of context is considered relevant.

First, an LLM is prompted to determine if each piece of context is relevant to the query. Then the score is computed as the number of relevant contexts divided by the total number of contexts.

\[Context Relevance = \frac{\textnormal{Number of Relevant Contexts}}{\textnormal{Total Number of Contexts}}\]

Our implementation follows DeepEval's implementation. The LLM instruction was adapted from DeepEval's instruction.

Faithfulness (LLM-Guided)

Uses - Prediction - Annotation.text - Prediction - Annotation.context

Faithfulness is the proportion of claims from the predicted text that are implied by the retrieved contexts.

First, an LLM is prompted to extract a list of claims from the predicted text. Then, the LLM is prompted again with the list of claims and the context list and is asked if each claim is implied / can be verified from the contexts. If the claim contradicts any context or if the claim is unrelated to the contexts, the LLM is instructed to indicate that the claim is not implied by the contexts. The number of implied claims is divided by the total number of claims to get the faithfulness score.

\[Faithfulness = \frac{\textnormal{Number of Implied Claims}}{\textnormal{Total Number of Claims}}\]

Our implementation loosely follows and combines the strategies of DeepEval and RAGAS, however it is notable that DeepEval and RAGAS's definitions of faithfulness are not equivalent. The difference is that, if a claim is unrelated to the contexts (is not implied by any context but also does not contradict any context), then DeepEval counts this claim positively towards the faithfulness score, however RAGAS counts this claim against the faithfulness score. Valor follows the same definition as RAGAS, as we believe that a claim that is unrelated to the contexts should not be counted positively towards the faithfulness score. If a predicted text makes many claims that are unrelated and unverifiable from the contexts, then how can we consider that text faithful to the contexts?

We follow DeepEval's prompting strategy as this strategy is closer to the other prompting strategies in Valor, however we heavily modify the instructions. Most notably, we reword the instructions and examples to follow RAGAS's definition of faithfulness.

Hallucination (LLM-Guided)

Uses - Prediction - Annotation.text - Prediction - Annotation.context

Hallucination is the proportion of contexts that are contradicted by the predicted text. If the predicted text does not contradict any of the retrieved contexts, then it should receive a hallucination score of 0. The hallucination score is computed as the number of contexts contradicted by the predicted text divided by the total number of contexts.

Given the context list and the predicted text, an LLM is prompted to determine if the text agrees or contradicts with each piece of context. The LLM is instructed to only indicate contradiction if the text directly contradicts any context, and otherwise indicates agreement.

\[Hallucination = \frac{\textnormal{Number of Contradicted Contexts}}{\textnormal{Total Number of Contexts}}\]

Note the differences between faithfulness and hallucination. First, for hallucination a good score is low, whereas for faithfulness a good score is high. Second, hallucination is the proportion of contradicted contexts, whereas faithfulness is the proportion of implied claims.

Our implementation follows DeepEval's implementation.

Summarization Metrics

Summarization is the task of generating a shorter version of a piece of text that retains the most important information. Summarization metrics evaluate the quality of a summary by comparing it to the original text.

Note that Datum.text is used differently for summarization than for Q&A and RAG tasks. For summarization, the Datum.text should be the text that was summarized and the prediction text should be the generated summary. This is different than Q&A and RAG where the Datum.text is the query and the prediction text is the generated answer.

Summary Coherence (LLM-Guided)

Uses - Datum.text - Prediction - Annotation.text

Summary coherence is an LLM-guided metric that measures the collective quality of a summary on an integer scale of 1 to 5, where 5 indicates the highest summary coherence. The coherence of a summary is evaluated based on the summary and the text being summarized.

An LLM is prompted to evaluate the collective quality of a summary given the text being summarized. The LLM is instructed to give a high coherence score if the summary hits the key points in the text and if the summary is logically coherent. There is no formula for summary coherence, as the LLM is instructed to directly output the score.

Valor's implementation of the summary coherence metric uses an instruction that was adapted from appendix A of DeepEval's paper G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment. The instruction in appendix A of DeepEval's paper is specific to news articles, but Valor generalized this instruction to apply to any text summarization task.

Non-LLM-Guided Text Comparison Metrics

This section contains non-LLM-guided metrics for comparing a predicted text to one or more ground truth texts. These metrics can be run without specifying any LLM api parameters.

ROUGE

Uses - GroundTruth - Annotation.text - Prediction - Annotation.text

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

In Valor, the ROUGE output value is a dictionary containing the following elements:

{
    "rouge1": 0.18, # unigram-based similarity scoring
    "rouge2": 0.08, # bigram-based similarity scoring
    "rougeL": 0.18, # similarity scoring based on sentences (i.e., splitting on "." and ignoring "\n")
    "rougeLsum": 0.18, # similarity scoring based on splitting the text using "\n"
}

Behind the scenes, we use Hugging Face's evaluate package to calculate these scores. Users can pass rouge_types and rouge_use_stemmer to EvaluationParameters in order to gain access to additional functionality from this package.

BLEU

Uses - GroundTruth - Annotation.text - Prediction - Annotation.text

BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating automatic summarization and machine translation software in natural language processing. BLEU's output is always a number between 0 and 1, where a score near 1 indicates that the hypothesis text is very similar to one or more of the reference texts.

Behind the scenes, we use nltk.translate.bleu_score to calculate these scores. The default BLEU metric calculates a score for up to 4-grams using uniform weights (i.e., weights=[.25, .25, .25, .25]; also called BLEU-4). Users can pass their own bleu_weights to EvaluationParameters in order to change this default behavior and calculate other BLEU scores.

Text Generation Metrics

Metric dataclass

answer_correctness(value, model_name, retries) classmethod

answer_relevance(value, model_name, retries) classmethod

bias(value, model_name, retries) classmethod

bleu(value, weights) classmethod

context_precision(value, model_name, retries) classmethod

context_recall(value, model_name, retries) classmethod

context_relevance(value, model_name, retries) classmethod

faithfulness(value, model_name, retries) classmethod

hallucination(value, model_name, retries) classmethod

rouge(value, rouge_type, use_stemmer) classmethod

summary_coherence(value, model_name, retries) classmethod

toxicity(value, model_name, retries) classmethod

Locally Computed Metrics

LLM-Guided Metrics

Appendix: Metric Calculations

General Text Generation Metrics

Bias (LLM-Guided)

Toxicity (LLM-Guided)

Q&A Metrics

Answer Correctness (LLM-Guided)

Answer Relevance (LLM-Guided)

RAG Metrics

Context Precision (LLM-Guided)

Context Recall (LLM-Guided)

Context Relevance (LLM-Guided)

Faithfulness (LLM-Guided)

Hallucination (LLM-Guided)

Summarization Metrics

Summary Coherence (LLM-Guided)

Non-LLM-Guided Text Comparison Metrics

ROUGE

BLEU

`Metric` `dataclass`

`answer_correctness(value, model_name, retries)` `classmethod`

`answer_relevance(value, model_name, retries)` `classmethod`

`bias(value, model_name, retries)` `classmethod`

`bleu(value, weights)` `classmethod`

`context_precision(value, model_name, retries)` `classmethod`

`context_recall(value, model_name, retries)` `classmethod`

`context_relevance(value, model_name, retries)` `classmethod`

`faithfulness(value, model_name, retries)` `classmethod`

`hallucination(value, model_name, retries)` `classmethod`

`rouge(value, rouge_type, use_stemmer)` `classmethod`

`summary_coherence(value, model_name, retries)` `classmethod`

`toxicity(value, model_name, retries)` `classmethod`