Building a Production-Ready RAG (Retrieval-Augmented Generation) System with .NET and Azure AI Search

Introduction

Retrieval‑augmented generation (RAG) is a pattern that combines information retrieval with large language model (LLM) inference. Instead of asking an LLM to answer questions from memory, a RAG system first retrieves relevant facts from your own content, then feeds those facts into a generative model to produce an answer. Microsoft’s Azure AI Search and Azure OpenAI services make it straightforward to build a RAG solution on .NET. This article explores the architecture of a production‑ready RAG pipeline, explains how Azure AI Search supports hybrid search and vector search, and demonstrates how to assemble a complete solution in C#. Whether you are working on a chat bot, knowledge assistant or enterprise search experience, the guidelines here will help you deliver reliable, explainable, cost‑effective AI.

What is Retrieval‑Augmented Generation?

A RAG system augments an LLM with a dedicated retrieval step that injects fresh context into the model’s prompt. Microsoft describes RAG as a pattern in which an application “augments a chat model with an information retrieval system that incorporates enterprise content”. The typical flow is:

  1. A user submits a question via a chat interface or API.
  2. The question is sent to an information retrieval layer that searches a corpus of company documents. In Azure AI Search this may involve keyword search, semantic ranking, vector similarity or hybrid queries. The system returns the most relevant passages along with associated metadata.
  3. The retrieved passages are combined with the original question to form a prompt for the LLM. The prompt instructs the model to answer by using the provided facts and to cite sources.
  4. The LLM generates a response that is delivered to the user. Optionally a post‑processing step runs safety and factuality checks before returning the answer.

RAG offers several advantages: it grounds answers in authoritative data, improves factual accuracy, reduces hallucinations and allows you to keep sensitive information behind your firewall. However, building a robust system requires careful attention to data quality, indexing, retrieval and prompt design. The sections below walk through these concerns in detail.

Core Components of a Production‑Ready RAG System

At a high level a production RAG pipeline comprises three stages: ingestion, retrieval and generation. The ingestion stage prepares your data for retrieval by cleaning, chunking and indexing it. The retrieval stage executes fast queries over the index to find the most relevant chunks for a given question. Finally, the generation stage composes a prompt from the question and retrieved chunks and calls the LLM.

Ingestion: Preparing and Indexing Content

The ingestion process is critical to retrieval quality. Microsoft’s advanced RAG guidance notes that content quality is paramount: you should standardise text, handle special characters, and remove irrelevant content before indexing. Key tasks include:

  • Text extraction and cleaning. Extract text from PDFs, Word documents or databases, then normalise encodings, fix broken sentence boundaries and remove headers or footers. Track document IDs and metadata to maintain an audit trail.
  • Chunking strategy. The search index stores content as “chunks”. Choose a chunk size that balances recall and context: too small and the LLM receives fragments; too large and the retrieval accuracy suffers. Microsoft recommends experimenting with overlapping windows, sliding windows, hierarchical or specialised indexes to improve coverage.
  • Metadata and tagging. Enrich each chunk with metadata such as source document name, section heading, page number and access rights. This allows retrieval to filter results and instructs the LLM on how to cite sources.
  • Vectorisation. Compute embeddings for each chunk using a model like text-embedding-ada-002 from Azure OpenAI, or rely on Azure AI Search’s integrated vectorisation (more on this later). Embeddings capture semantic meaning beyond simple keyword matching and power vector search.

Azure AI Search provides indexers that can connect to Blob Storage, SQL, Cosmos DB or SharePoint, extract documents and apply vectorisation during indexing. When creating the index, mark fields with the retrievable attribute so that the retrieval layer can return them later. If you enable semantic ranker and vector search, the index will store both full‑text and vector representations, enabling hybrid queries.

Retrieval: Searching with Azure AI Search

The retrieval stage finds the most relevant chunks for a given question. Azure AI Search supports multiple search modalities that can be combined for maximum recall and precision:

  • Keyword search. Basic full‑text search that matches words in the query. It’s fast and effective but cannot capture semantic similarity.
  • Semantic search. A re‑ranking service that uses deep learning to improve the ordering of full‑text results. It provides a more contextual understanding of language and supports summarisation.
  • Vector search. Searches over numeric embeddings rather than raw text. It finds passages that are conceptually similar to the query, even if they do not share keywords. Azure AI Search supports multilingual and multimodal vector search and allows hybrid search combining vector and keyword queries.
  • Hybrid search. Combines keyword and vector similarity scores in one query. Microsoft notes that hybrid queries with semantic ranker produce the most relevant results because they maximise recall and precision.

To maximise retrieval quality, Microsoft recommends using hybrid search with vector similarity, semantic ranking and scoring profiles. You can control the number of returned results (K) and the weighting between keyword and vector scores. Filtering results by metadata (e.g., tenant ID or access level) ensures compliance and relevance.

How Vector Search Works

Vector search in Azure AI Search uses nearest neighbour algorithms to retrieve the k most similar vectors to a query. The service can store vectors in an index using integrated vectorisation (which automatically generates embeddings during indexing) or accept externally generated vectors. When issuing a query, you can provide a raw vector (if you compute embeddings client‑side) or supply a text query and let integrated vectorisation compute its embedding on the fly. Vector search is available on all service tiers at no extra cost and can be combined with filters and semantic ranking.

Generation: Composing Prompts and Calling the LLM

Once you have the top‑K passages, you assemble a prompt for your chosen LLM. A typical prompt includes:

  • system message defining the assistant’s behaviour, tone and constraints.
  • One or more context messages containing retrieved passages with citation markers (e.g., [doc1][doc2]). Instruct the model to use these citations when forming the answer.
  • The user’s original question.

For Azure OpenAI, you call the ChatCompletion API with these messages. Set parameters such as temperature, max tokens and stop sequences to balance creativity and factual accuracy. It’s good practice to instruct the model to respond only with information in the provided context and to admit when the answer is unknown. After generation, a post‑processing step may run safety filters, summarise the answer or reformat citations.

Implementing a RAG Pipeline in .NET

With the fundamentals established, let’s explore how to implement a RAG system using C#. We will use the following Azure SDKs:

  • Azure.Search.Documents for indexing and querying the Azure AI Search index.
  • Azure.AI.OpenAI for generating embeddings and chat completions.
  • Azure.Core for authentication via DefaultAzureCredential.

Assuming you have already provisioned an Azure AI Search service and an Azure OpenAI resource, the high‑level steps are:

  1. Create or update a search index with fields for text content, embeddings, metadata and citations.
  2. Write an ingestion routine to read documents, chunk them, compute embeddings (if not using integrated vectorisation) and upload them to the index.
  3. Implement a retrieval routine that accepts a user query, computes its embedding, executes a hybrid search and returns the top K results.
  4. Compose a prompt using the retrieved passages and call the ChatCompletion API.
  5. Return the answer to the user along with citations.

1. Defining the Search Index

Below is a sample C# class representing the schema of our search documents. Each document stores the original text, its embedding vector and metadata. Note that the vector field is configured with the appropriate dimensions (e.g., 1536 for text-embedding-ada-002) and vectorSearchDimensions must match your embedding length.

using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;

public static SearchIndex BuildIndex(string indexName)
{
    return new SearchIndex(indexName)
    {
        Fields =
        {
            new SearchField("id", SearchFieldDataType.String)
                { IsKey = true, IsFilterable = true },
            new SearchField("content", SearchFieldDataType.String)
                { IsSearchable = true, IsRetrievable = true },
            new SearchField("embedding", SearchFieldDataType.Collection(SearchFieldDataType.Single))
                {
                    VectorSearchDimensions = 1536,
                    VectorSearchConfiguration = "default-vectors",
                    IsRetrievable = false
                },
            new SearchField("source", SearchFieldDataType.String)
                { IsFilterable = true, IsRetrievable = true },
            new SearchField("page", SearchFieldDataType.Int32)
                { IsFilterable = true, IsRetrievable = true }
        },
        VectorSearch = new VectorSearch
        {
            AlgorithmConfigurations = { new VectorSearchAlgorithmConfiguration("default-vectors") }
        },
        SemanticSettings = new SemanticSettings
        {
            Configurations = { new SemanticConfiguration("default-semantic",  new PrioritizedFields
                {
                    TitleField = new SemanticField { FieldName = "source" },
                    ContentFields = { new SemanticField { FieldName = "content" } }
                }) }
        }
    };
}

You would then create this index using SearchIndexClient and the CreateOrUpdateIndexAsync method.

2. Ingesting Data

This step reads documents, splits them into chunks, computes embeddings and uploads them to the index. If you choose integrated vectorisation, you can skip computing embeddings client‑side and instead enable vectorisation on your indexer. Otherwise use the Azure OpenAI embedding API via OpenAIClient. Here’s a simplified ingestion routine:

using Azure;
using Azure.AI.OpenAI;
using Azure.Search.Documents;
using Azure.Search.Documents.Models;

public async Task IngestDocumentsAsync(string indexName, IEnumerable<DocumentChunk> chunks)
{
    var searchClient = new SearchClient(new Uri("https://<your-search-service>.search.windows.net"),
                                        indexName,
                                        new AzureKeyCredential("<api-key>"));
    var openAiClient = new OpenAIClient(new Uri("https://<your-openai-endpoint>"),
                                        new AzureKeyCredential("<openai-key>"));

    List<SearchDocument> batch = new List<SearchDocument>();

    foreach (var chunk in chunks)
    {
        // Compute embedding using OpenAI (skip if using integrated vectorisation)
        var embeddingResp = await openAiClient.GetEmbeddingsAsync(
            "text-embedding-ada-002",
            new EmbeddingsOptions(new[] { chunk.Content }));

        float[] embedding = embeddingResp.Value.Data[0].Embedding.ToArray();

        var doc = new SearchDocument
        {
            ["id"] = chunk.Id,
            ["content"] = chunk.Content,
            ["embedding"] = embedding,
            ["source"] = chunk.SourceFile,
            ["page"] = chunk.PageNumber
        };
        batch.Add(doc);

        // Upload in batches of 100 for efficiency
        if (batch.Count == 100)
        {
            await searchClient.UploadDocumentsAsync(batch);
            batch.Clear();
        }
    }
    if (batch.Count > 0)
    {
        await searchClient.UploadDocumentsAsync(batch);
    }
}

This code uses the search client to upload documents with embeddings. If you configure an indexer to handle vectorisation automatically (for example, by setting vectorizer=@microsoft.azure.openai in the search indexer), you only need to upload the raw text and the service will generate the vector field for you.

3. Retrieving Relevant Passages

When a user asks a question, you need to search the index and get the top‑K passages. In a hybrid query you supply both a search text and a vector. The following example shows how to call Azure AI Search using a hybrid search with semantic ranking. For brevity, error handling is omitted:

public async Task<IEnumerable<SearchResult>> RetrieveAsync(string indexName, string question, int topK)
{
    var searchClient = new SearchClient(new Uri("https://<your-search-service>.search.windows.net"),
                                        indexName,
                                        new AzureKeyCredential("<api-key>"));
    var openAiClient = new OpenAIClient(new Uri("https://<your-openai-endpoint>"),
                                        new AzureKeyCredential("<openai-key>"));

    // Compute embedding for the question
    var embedResp = await openAiClient.GetEmbeddingsAsync(
        "text-embedding-ada-002",
        new EmbeddingsOptions(new[] { question }));
    float[] queryEmbedding = embedResp.Value.Data[0].Embedding.ToArray();

    // Build vector parameter
    var vector = new VectorQuery(queryEmbedding)
    {
        KNearestNeighborsCount = topK,
        Fields = { "embedding" }
    };

    var options = new SearchOptions
    {
        Size = topK,
        VectorQueries = { vector },
        QueryType = SearchQueryType.Semantic,
        SemanticConfigurationName = "default-semantic",
        Select = { "content", "source", "page" },
        IncludeTotalCount = false
    };

    // Execute hybrid search: the question text plus vector similarity
    var response = await searchClient.SearchAsync<SearchDocument>(question, options);
    return response.Value.GetResults();
}

The SearchOptions object configures semantic ranking and selects only the fields we need. Setting QueryType=Semantic applies the semantic ranker, which can significantly improve result ordering. The VectorQueries property uses our computed embedding to perform a vector similarity search; combined with the query text this yields a hybrid search. If you are using integrated vectorisation, you can omit the explicit embedding step and specify VectorQueries = { new VectorQuery(question) } if the service auto‑vectorises queries.

4. Constructing the Prompt and Generating an Answer

After retrieving the top results, build a prompt with citations. For each result, extract the content and wrap it in citation markers such as [source1]. Then pass the prompt to the ChatCompletion API:

public async Task<string> GenerateAnswerAsync(string question, IEnumerable<SearchResult> results)
{
    var openAiClient = new OpenAIClient(new Uri("https://<your-openai-endpoint>"),
                                        new AzureKeyCredential("<openai-key>"));

    // Build context with citations
    var sb = new StringBuilder();
    int idx = 1;
    foreach (var res in results)
    {
        var doc = res.Document;
        sb.AppendLine($"[{idx}] {doc["content"]}\n");
        idx++;
    }

    string context = sb.ToString();

    var messages = new List<ChatMessage>
    {
        new ChatMessage(ChatRole.System, "You are a helpful assistant that answers questions based on the provided sources. Cite sources using [number] notation."),
        new ChatMessage(ChatRole.System, $"Sources:\n{context}"),
        new ChatMessage(ChatRole.User, question)
    };

    var chatOptions = new ChatCompletionsOptions
    {
        Messages = { messages },
        Temperature = 0.0f,
        MaxTokens = 512,
        NucleusSamplingFactor = null,
        PresencePenalty = 0.0f,
        FrequencyPenalty = 0.0f
    };

    var completions = await openAiClient.GetChatCompletionsAsync(
        deploymentOrModelName: "gpt-35-turbo",
        chatOptions);
    return completions.Value.Choices[0].Message.Content;
}

By separating the context and question into different messages, you prevent the model from confusing retrieved passages with the user question. Use a low temperature (e.g., 0.0–0.2) to ensure deterministic, factual answers. You can further instruct the model to respond only with information present in the provided context and to say “I don’t know” when the answer is not in the context.

Best Practices for Building a RAG System

A production RAG system must perform well under real workloads, provide secure and reliable responses, and be maintainable over time. Here are key recommendations derived from Microsoft’s advanced RAG guidance and the Azure AI Search documentation:

Content Processing and Indexing

  • Clean your data. Quality content yields quality answers. Standardise text, remove noise, fix encoding issues, and track versions.
  • Optimise chunking. Experiment with chunk size and overlap. Use hierarchical or specialised indexes if your content has natural sections (e.g., chapters, product categories).
  • Enrich with metadata. Include fields such as document type, author, publication date and access control. Use filters to restrict queries to relevant subsets of data.
  • Leverage integrated vectorisation. Using Azure AI Search’s vectorisation can simplify your ingestion pipeline and reduce code complexity.

Retrieval Tuning

  • Use hybrid search with semantic ranker. Combining keyword, vector and semantic ranking yields better recall and precision.
  • Tune search parameters. Adjust the number of returned results (k), the weighting of vector versus keyword matches, and scoring profiles. Add synonyms or custom analyzers to handle domain‑specific language.
  • Filter by metadata. Ensure that queries respect access levels (e.g., per tenant) and other constraints (e.g., only return documents of a certain type).
  • Implement query rewriting. Preprocess user questions to remove noise, disambiguate, and generate subqueries if necessary. Microsoft’s advanced RAG guidance suggests techniques like step‑back prompting and hypothetical document embeddings (HyDE).

Prompt Engineering and Generation

  • Separate context and question. Use separate messages for retrieved passages and user prompts to avoid conflating them.
  • Instruct explicitly. Tell the model to only use provided information, cite sources, and avoid fabricating answers.
  • Control temperature and max tokens. Lower temperatures yield more deterministic outputs; restrict max tokens to control cost and latency.
  • Post‑process results. Apply safety filters, summarisation or formatting as necessary. Use heuristics to remove irrelevant sentences from the answer or to rewrite citations.

Operational Considerations

  • Logging and observability. Track retrieval queries, retrieved documents, prompt sizes, completion latency and token usage. This helps debug and optimise your pipeline.
  • Evaluation and feedback. Build tools to evaluate answer quality (for example, by computing precision/recall or using human ratings). Use this feedback to adjust chunking, ranking parameters and prompts.
  • Security and compliance. Ensure that only authorised data is retrieved and passed to the LLM. Use managed identities for Azure services and restrict access keys. Consider using encryption at rest and in transit.
  • Scalability. Azure AI Search scales by adjusting partitions and replicas. For high traffic, add replicas; for large indexes or heavy vector search, add partitions. Cache frequent queries and responses to reduce costs.

Putting It All Together: System Architecture

Below is a typical architecture for a production‑ready RAG solution on Azure. Adapt it to your needs:

  1. Content Source. Documents stored in Azure Blob Storage, SQL Database, Cosmos DB or SharePoint.
  2. Ingestion Pipeline. A .NET worker or Azure Data Factory pipeline extracts, cleans and chunks documents, computes embeddings (or uses integrated vectorisation) and uploads them to an Azure AI Search index.
  3. Azure AI Search. Hosts the hybrid index with both text and vector fields and applies semantic ranking and filters.
  4. API Layer (Azure Functions or ASP.NET Core). Receives user questions, calls the retrieval function, builds a prompt and invokes the Azure OpenAI ChatCompletion API.
  5. LLM and Post‑processing. The OpenAI model generates a response; the API layer post‑processes and returns the answer with citations.
  6. Observability. Application Insights or custom telemetry collects metrics on search queries, retrieval latencies, token usage and user satisfaction to drive iterative improvement.

Conclusion

Retrieval‑augmented generation enables powerful AI experiences by grounding large language models in your own data. Microsoft’s Azure AI Search and Azure OpenAI services provide a solid foundation for building such systems on .NET. By carefully preparing your content, choosing appropriate chunking strategies, leveraging hybrid search with semantic ranker, and following best practices for prompt engineering and security, you can deliver high‑quality, trustworthy answers to your users. A production‑ready RAG system doesn’t end at deployment; it requires ongoing evaluation and optimisation to maintain relevance and safety. Armed with the guidance in this article and the sample code above, you are ready to build and scale your own RAG solution with .NET and Azure AI Search.

Leave a comment