Evaluation
Evaluation is a form of testing that helps you validate your LLM’s responses and ensure they meet your quality bar.
Genkit supports third-party evaluation tools through plugins, paired with powerful observability features that provide insight into the runtime state of your LLM-powered applications. Genkit tooling helps you automatically extract data including inputs, outputs, and information from intermediate steps to evaluate the end-to-end quality of LLM responses as well as understand the performance of your system’s building blocks.
Types of evaluation
Section titled “Types of evaluation”Genkit supports two types of evaluation:
-
Inference-based evaluation: This type of evaluation runs against a collection of pre-determined inputs, assessing the corresponding outputs for quality.
This is the most common evaluation type, suitable for most use cases. This approach tests a system’s actual output for each evaluation run.
You can perform the quality assessment manually, by visually inspecting the results. Alternatively, you can automate the assessment by using an evaluation metric.
-
Raw evaluation: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (e.g.,
input,context,outputandreference) must be present in the input dataset. This is useful when you have data coming from an external source (e.g., collected from your production traces) and you want to have an objective measurement of the quality of the collected data.For more information, see the Advanced use section of this page.
This section explains how to perform inference-based evaluation using Genkit.
Genkit for Dart supports the full evaluation framework. You can run evaluations with or without automated metrics. If you want automated metrics in Dart, you can implement them as custom evaluators.
Quick start
Section titled “Quick start”- Use an existing Genkit app or create a new one by following our Get started guide.
- Add the following code to define a simple RAG application to evaluate. For this guide, we simulate retrieval by providing a hardcoded list of documents.
import 'package:genkit/genkit.dart';import 'package:genkit_google_genai/genkit_google_genai.dart';
final ai = Genkit(plugins: [googleAI()]);
// A simple question-answering flowfinal qaFlow = ai.defineFlow( name: 'qaFlow', inputSchema: .string(), outputSchema: .string(), fn: (query, context) async { final facts = [ "Dog is man's best friend", 'Dogs have evolved and were domesticated from wolves', ];
final prompt = '''Answer this question with the given context:Question: $queryContext:${facts.map((f) => "- $f").join('\n')}''';
final response = await ai.generate( model: googleAI.gemini('gemini-2.5-flash'), prompt: prompt, ); return response.text ?? ''; },);- Start your Genkit application.
genkit start -- dart run bin/evals.dartCreate a dataset
Section titled “Create a dataset”Create a dataset to define the examples we want to use for evaluating our flow.
- Go to the Dev UI at
http://localhost:4000and click the Datasets button to open the Datasets page. - Click on the Create Dataset button to open the create dataset dialog.
a. Provide a
datasetIdfor your new dataset. This guide usesmyFactsQaDataset. b. SelectFlowdataset type. c. Leave the validation target field empty and click Save - Your new dataset page appears, showing an empty dataset. Add examples to it by following these steps:
a. Click the Add example button to open the example editor panel.
b. Only the
inputfield is required. Enter"Who is man's best friend?"in theinputfield, and click Save. c. Repeat steps (a) and (b) to add more examples:"Can I give milk to my cats?""From which animals did dogs evolve?"
Run evaluation and view results
Section titled “Run evaluation and view results”To start evaluating the flow, click the Run new evaluation button on your dataset page.
- Select the
Flowradio button to evaluate a flow. - Select
qaFlowas the target flow to evaluate. - Select
myFactsQaDatasetas the target dataset to use for evaluation. - (Optional) If you have defined custom evaluators, you can select them here. Otherwise, you can run the evaluation without metrics to inspect the outputs manually.
- Click Run evaluation to start evaluation. Once complete, click the link to go to the Evaluation details page to view the results.
Core concepts
Section titled “Core concepts”Terminology
Section titled “Terminology”- Evaluation: A process that assesses system performance.
- Bulk inference: Running inference on multiple inputs simultaneously.
- Metric: A criterion on which an inference is scored. In Dart, metrics are implemented as custom evaluators.
- Dataset: A collection of examples to use for inference-based evaluation.
Custom evaluators
Section titled “Custom evaluators”You can extend Genkit to support custom evaluation by defining your own evaluator functions. An evaluator can use an LLM as a judge, perform programmatic (heuristic) checks, or call external APIs to assess the quality of a response.
You define a custom evaluator using the ai.defineEvaluator method.
Here’s an example of a custom evaluator:
import 'package:genkit/genkit.dart';
ai.defineEvaluator( name: 'custom', description: 'Custom evaluator', fn: (input, context) async { return [ ...input.dataset.map( (d) => EvalFnResponse( testCaseId: d.testCaseId!, evaluation: EvalFnResponseEvaluation.score( Score( score: ScoreScore.bool(true), status: EvalStatusEnum.PASS, details: {'reasoning': 'something, something, something....'}, ), ), ), ), ]; },);Advanced use
Section titled “Advanced use”Evaluation using the CLI
Section titled “Evaluation using the CLI”Genkit CLI provides 3 main evaluation commands: eval:flow, eval:extractData, and eval:run.
Refer to the Node.js or Go sections for more details on using these commands, as the CLI usage is consistent across languages.