Retrieval-augmented generation (RAG)

Genkit provides abstractions that help you build retrieval-augmented generation (RAG) flows, as well as plugins that provide integrations with related tools.

What is RAG?

Retrieval-augmented generation is a technique used to incorporate external sources of information into an LLM’s responses. It’s important to be able to do so because, while LLMs are typically trained on a broad body of material, practical use of LLMs often requires specific domain knowledge (for example, you might want to use an LLM to answer customers’ questions about your company’s products).

One solution is to fine-tune the model using more specific data. However, this can be expensive both in terms of compute cost and in terms of the effort needed to prepare adequate training data.

In contrast, RAG works by incorporating external data sources into a prompt at the time it’s passed to the model. For example, you could imagine the prompt, “What is Bart’s relationship to Lisa?” might be expanded (“augmented”) by prepending some relevant information, resulting in the prompt, “Homer and Marge’s children are named Bart, Lisa, and Maggie. What is Bart’s relationship to Lisa?”

This approach has several advantages:

It can be more cost-effective because you don’t have to retrain the model.
You can continuously update your data source and the LLM can immediately make use of the updated information.
You now have the potential to cite references in your LLM’s responses.

On the other hand, using RAG naturally means longer prompts, and some LLM API services charge for each input token you send. Ultimately, you must evaluate the cost tradeoffs for your applications.

For Python, Genkit’s RAG support centers on the shared Document model and embedders: use ai.embed and ai.embed_many with plugin-registered embedder names to produce vectors. Combine those embeddings with your ingestion, storage, and search code, then pass retrieved documents into ai.generate(..., docs=...) to ground answers.

Indexers

Retrievers

RAG is a very broad area and there are many different techniques used to achieve the best quality RAG. Conceptually, most pipelines involve:

Indexing: add documents to a store (often chunk → embed → upsert vectors).
Embedders: turn text (or other content) into vectors via a Genkit embedder action.
Retrieval: fetch relevant chunks for a query (your DB client or search layer).
Generation: call the model with the user question and retrieved context.

Genkit provides the Document model and embedder actions for turning content into vectors you can use throughout that pipeline.

Ingestion

Typical steps: chunk source text, compute embeddings with ai.embed or ai.embed_many, then write text + vector + metadata to your database. The following is illustrative only—the storage calls are placeholders for your client library.

from genkit import Document, Genkit

ai = Genkit(plugins=[...])  # e.g. VertexAI / GoogleAI for embedders

async def ingest_chunks(chunks: list[Document], embedder: str) -> None:
    for doc in chunks:
        embeddings = await ai.embed(embedder=embedder, content=doc)
        vector = embeddings[0].embedding
        # await my_vector_db.upsert(text=doc.text, embedding=vector, metadata=doc.metadata)
        _ = vector  # replace with real persistence

Embedders

Use plugin-registered embedder names (for example vertexai/text-embedding-005) with ai.embed / ai.embed_many. Keep the same model for indexing and query embedding when using vector similarity search.

Generation with `docs`

Implement search with your datastore (vector search, hybrid search, etc.), producing a list[Document]. Pass that list as docs to ai.generate so the model can use the context.

from genkit import Document, Genkit
from genkit.plugins.google_genai import VertexAI

ai = Genkit(
    plugins=[VertexAI(location='us-central1')],
    model='vertexai/gemini-2.5-flash',
)

async def fetch_context(user_query: str, limit: int = 3) -> list[Document]:
    """Replace with your vector store: embed the query, search, map hits to Documents."""
    embeddings = await ai.embed(
        embedder='vertexai/text-embedding-005',
        content=user_query,
    )
    _query_vector = embeddings[0].embedding
    # rows = await my_vector_db.search(_query_vector, limit=limit)
    # return [Document.from_text(r.text, metadata=r.meta) for r in rows]
    return [
        Document.from_text('Example: seasonal dessert — fruit tart (contains dairy).'),
        Document.from_text('Example: allergen note — sorbet is dairy-free.'),
    ][:limit]

@ai.flow()
async def qa_flow(query: str) -> str:
    docs = await fetch_context(query)
    response = await ai.generate(
        prompt=query,
        docs=docs,
    )
    return response.text

Run the flow

result = await qa_flow('Recommend a dessert while avoiding dairy and nuts')
print(result)

Composing search and prompts

Any async function that returns list[Document] can supply context: call databases, APIs, or custom ranking, then pass the result to generate(docs=...). Advanced patterns (reranking, prompt expansion) are ordinary Python code composed around embed and generate.

Next steps

Learn about tool calling to give your RAG system access to external APIs and functions
Explore multi-agent systems for coordinating multiple AI agents with RAG capabilities
Check out the vector database plugins for production-ready RAG implementations
See the evaluation guide for testing and improving your RAG system’s performance