Evaluation

Evaluation is a form of testing that helps you validate your LLM’s responses and ensure they meet your quality bar.

Genkit supports third-party evaluation tools through plugins, paired with powerful observability features that provide insight into the runtime state of your LLM-powered applications. Genkit tooling helps you automatically extract data including inputs, outputs, and information from intermediate steps to evaluate the end-to-end quality of LLM responses as well as understand the performance of your system’s building blocks.

Types of evaluation

Genkit supports two types of evaluation:

Inference-based evaluation: This type of evaluation runs against a collection of pre-determined inputs, assessing the corresponding outputs for quality.

This is the most common evaluation type, suitable for most use cases. This approach tests a system’s actual output for each evaluation run.

You can perform the quality assessment manually, by visually inspecting the results. Alternatively, you can automate the assessment by using an evaluation metric.
Raw evaluation: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (e.g., input, context, output and reference) must be present in the input dataset. This is useful when you have data coming from an external source (e.g., collected from your production traces) and you want to have an objective measurement of the quality of the collected data.

For more information, see the Advanced use section of this page.

This section explains how to perform inference-based evaluation using Genkit.

Genkit for Dart supports the full evaluation framework. You can run evaluations with or without automated metrics. If you want automated metrics in Dart, you can implement them as custom evaluators.

Quick start

Setup

Use an existing Genkit app or create a new one by following our Get started guide.
Add the following code to define a simple RAG application to evaluate. For this guide, we simulate retrieval by providing a hardcoded list of documents.

import 'package:genkit/genkit.dart';
import 'package:genkit_google_genai/genkit_google_genai.dart';

final ai = Genkit(plugins: [googleAI()]);

// A simple question-answering flow
final qaFlow = ai.defineFlow(
  name: 'qaFlow',
  inputSchema: .string(),
  outputSchema: .string(),
  fn: (query, context) async {
    final facts = [
      "Dog is man's best friend",
      'Dogs have evolved and were domesticated from wolves',
    ];

    final prompt = '''
Answer this question with the given context:
Question: $query
Context:
${facts.map((f) => "- $f").join('\n')}
''';

    final response = await ai.generate(
      model: googleAI.gemini('gemini-flash-latest'),
      prompt: prompt,
    );
    return response.text ?? '';
  },
);

Start your Genkit application.

genkit start -- dart run bin/evals.dart

Create a dataset

Create a dataset to define the examples we want to use for evaluating our flow.

Go to the Dev UI at http://localhost:4000 and click the Datasets button to open the Datasets page.
Click on the Create Dataset button to open the create dataset dialog. a. Provide a datasetId for your new dataset. This guide uses myFactsQaDataset. b. Select Flow dataset type. c. Leave the validation target field empty and click Save
Your new dataset page appears, showing an empty dataset. Add examples to it by following these steps: a. Click the Add example button to open the example editor panel. b. Only the input field is required. Enter "Who is man's best friend?" in the input field, and click Save. c. Repeat steps (a) and (b) to add more examples:
- "Can I give milk to my cats?"
- "From which animals did dogs evolve?"

Run evaluation and view results

To start evaluating the flow, click the Run new evaluation button on your dataset page.

Select the Flow radio button to evaluate a flow.
Select qaFlow as the target flow to evaluate.
Select myFactsQaDataset as the target dataset to use for evaluation.
(Optional) If you have defined custom evaluators, you can select them here. Otherwise, you can run the evaluation without metrics to inspect the outputs manually.
Click Run evaluation to start evaluation. Once complete, click the link to go to the Evaluation details page to view the results.

Core concepts

Terminology

Evaluation: A process that assesses system performance.
Bulk inference: Running inference on multiple inputs simultaneously.
Metric: A criterion on which an inference is scored. In Dart, metrics are implemented as custom evaluators.
Dataset: A collection of examples to use for inference-based evaluation.

Custom evaluators

You can extend Genkit to support custom evaluation by defining your own evaluator functions. An evaluator can use an LLM as a judge, perform programmatic (heuristic) checks, or call external APIs to assess the quality of a response.

You define a custom evaluator using the ai.defineEvaluator method.

Here’s an example of a custom evaluator:

import 'package:genkit/genkit.dart';

ai.defineEvaluator(
  name: 'custom',
  description: 'Custom evaluator',
  fn: (input, context) async {
    return [
      ...input.dataset.map(
        (d) => EvalFnResponse(
          testCaseId: d.testCaseId!,
          evaluation: EvalFnResponseEvaluation.score(
            Score(
              score: ScoreScore.bool(true),
              status: EvalStatusEnum.PASS,
              details: {'reasoning': 'something, something, something....'},
            ),
          ),
        ),
      ),
    ];
  },
);

Advanced use

Evaluation using the CLI

Genkit CLI provides 3 main evaluation commands: eval:flow, eval:extractData, and eval:run.

Refer to the Node.js or Go sections for more details on using these commands, as the CLI usage is consistent across languages.