EvaluationParameters

Classes

`valor.schemas.evaluation.EvaluationParameters` `dataclass`

Defines parameters for evaluation methods.

Attributes

task_type: TaskType
    The task type of a given evaluation.
label_map: Optional[List[List[List[str]]]]
    Optional mapping of individual labels to a grouper label. Useful when you need to evaluate performance using labels that differ across datasets and models.
metrics_to_return: List[MetricType], optional
    The list of metrics to compute, store, and return to the user.
llm_api_params: Dict[str, str | dict], optional
    A dictionary of parameters for the LLM API.
convert_annotations_to_type: AnnotationType | None = None
    The type to convert all annotations to.
iou_thresholds_to_compute: List[float], optional
    A list of floats describing which Intersection over Unions (IoUs) to use when calculating metrics (i.e., mAP).
iou_thresholds_to_return: List[float], optional
    A list of floats describing which Intersection over Union (IoUs) thresholds to calculate a metric for. Must be a subset of `iou_thresholds_to_compute`.
recall_score_threshold: float, default=0
    The confidence score threshold for use when determining whether to count a prediction as a true positive or not while calculating Average Recall.
pr_curve_iou_threshold: float, optional
    The IOU threshold to use when calculating precision-recall curves for object detection tasks. Defaults to 0.5.
pr_curve_max_examples: int
    The maximum number of datum examples to store when calculating PR curves.
bleu_weights: list[float], optional
    The weights to use when calculating BLEU scores.
rouge_types: list[ROUGEType]
    A list of rouge types to calculate. Options are ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], where `rouge1` is unigram-based scoring, `rouge2` is bigram-based scoring, `rougeL` is scoring based on sentences (i.e., splitting on "." and ignoring "

"), and rougeLsum is scoring based on splitting the text using " ". rouge_use_stemmer: bool If True, uses Porter stemmer to strip word suffixes.

Source code in valor/schemas/evaluation.py

@dataclass
class EvaluationParameters:
    """
    Defines parameters for evaluation methods.

    Attributes
    ----------
    task_type: TaskType
        The task type of a given evaluation.
    label_map: Optional[List[List[List[str]]]]
        Optional mapping of individual labels to a grouper label. Useful when you need to evaluate performance using labels that differ across datasets and models.
    metrics_to_return: List[MetricType], optional
        The list of metrics to compute, store, and return to the user.
    llm_api_params: Dict[str, str | dict], optional
        A dictionary of parameters for the LLM API.
    convert_annotations_to_type: AnnotationType | None = None
        The type to convert all annotations to.
    iou_thresholds_to_compute: List[float], optional
        A list of floats describing which Intersection over Unions (IoUs) to use when calculating metrics (i.e., mAP).
    iou_thresholds_to_return: List[float], optional
        A list of floats describing which Intersection over Union (IoUs) thresholds to calculate a metric for. Must be a subset of `iou_thresholds_to_compute`.
    recall_score_threshold: float, default=0
        The confidence score threshold for use when determining whether to count a prediction as a true positive or not while calculating Average Recall.
    pr_curve_iou_threshold: float, optional
        The IOU threshold to use when calculating precision-recall curves for object detection tasks. Defaults to 0.5.
    pr_curve_max_examples: int
        The maximum number of datum examples to store when calculating PR curves.
    bleu_weights: list[float], optional
        The weights to use when calculating BLEU scores.
    rouge_types: list[ROUGEType]
        A list of rouge types to calculate. Options are ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], where `rouge1` is unigram-based scoring, `rouge2` is bigram-based scoring, `rougeL` is scoring based on sentences (i.e., splitting on "." and ignoring "\n"), and `rougeLsum` is scoring based on splitting the text using "\n".
    rouge_use_stemmer: bool
        If True, uses Porter stemmer to strip word suffixes.
    """

    task_type: TaskType
    label_map: Optional[List[List[List[str]]]] = None
    metrics_to_return: Optional[List[MetricType]] = None
    llm_api_params: Optional[Dict[str, Union[str, dict]]] = None

    convert_annotations_to_type: Optional[AnnotationType] = None
    iou_thresholds_to_compute: Optional[List[float]] = None
    iou_thresholds_to_return: Optional[List[float]] = None
    recall_score_threshold: float = 0
    pr_curve_iou_threshold: float = 0.5
    pr_curve_max_examples: int = 1
    bleu_weights: Optional[List[float]] = None
    rouge_types: Optional[List[ROUGEType]] = None
    rouge_use_stemmer: Optional[bool] = None

`valor.schemas.evaluation.EvaluationRequest` `dataclass`

An evaluation request.

Defines important attributes of the API's EvaluationRequest.

Attributes:

Name	Type	Description
`dataset_names`	`List[str]`	The list of datasets we want to evaluate by name.
`model_names`	`List[str]`	The list of models we want to evaluate by name.
`filters`	`dict`	The filter object used to define what the model(s) is evaluating against.
`parameters`	`EvaluationParameters`	Any parameters that are used to modify an evaluation method.

Source code in valor/schemas/evaluation.py

@dataclass
class EvaluationRequest:
    """
    An evaluation request.

    Defines important attributes of the API's `EvaluationRequest`.

    Attributes
    ----------
    dataset_names : List[str]
        The list of datasets we want to evaluate by name.
    model_names : List[str]
        The list of models we want to evaluate by name.
    filters : dict
        The filter object used to define what the model(s) is evaluating against.
    parameters : EvaluationParameters
        Any parameters that are used to modify an evaluation method.
    """

    dataset_names: Union[str, List[str]]
    model_names: Union[str, List[str]]
    parameters: EvaluationParameters
    filters: Filter = field(default_factory=Filter)

    def __post_init__(self):
        if isinstance(self.filters, dict):
            self.filters = Filter(**self.filters)
        elif self.filters is None:
            self.filters = Filter()

        if isinstance(self.parameters, dict):
            self.parameters = EvaluationParameters(**self.parameters)

    def to_dict(self) -> dict:
        """
        Converts the request into a JSON-compatible dictionary.
        """
        return {
            "dataset_names": self.dataset_names,
            "model_names": self.model_names,
            "parameters": asdict(self.parameters),
            "filters": self.filters.to_dict(),
        }

Functions

`valor.schemas.evaluation.EvaluationRequest.to_dict()`

Converts the request into a JSON-compatible dictionary.

Source code in valor/schemas/evaluation.py

def to_dict(self) -> dict:
    """
    Converts the request into a JSON-compatible dictionary.
    """
    return {
        "dataset_names": self.dataset_names,
        "model_names": self.model_names,
        "parameters": asdict(self.parameters),
        "filters": self.filters.to_dict(),
    }

EvaluationParameters

Classes

valor.schemas.evaluation.EvaluationParameters dataclass

valor.schemas.evaluation.EvaluationRequest dataclass

Functions

valor.schemas.evaluation.EvaluationRequest.to_dict()

`valor.schemas.evaluation.EvaluationParameters` `dataclass`

`valor.schemas.evaluation.EvaluationRequest` `dataclass`

`valor.schemas.evaluation.EvaluationRequest.to_dict()`