Building a RAG Pipeline: Step-by-Step Enterprise Guide
Retrieval-augmented generation (RAG) has become the standard architecture for enterprise AI applications that need to answer questions based on proprietary data. Unlike fine-tuning (which bakes knowledge into model weights), RAG retrieves relevant information at query time and feeds it to the language model as context. This means your AI always works with current data, respects access controls, and provides traceable, citation-backed answers.
This guide covers every stage of building a production RAG pipeline: from data ingestion and chunking to embedding, retrieval, re-ranking, generation, and evaluation. Whether you are building from scratch or using a platform like Skopx that provides RAG infrastructure out of the box, understanding these components will help you make better design decisions and debug issues faster.
Why RAG Over Fine-Tuning?
The choice between RAG and fine-tuning is not theoretical. It has practical implications for cost, freshness, accuracy, and maintainability.
Fine-tuning modifies the model's weights using your data. The knowledge becomes "baked in." This works well when the knowledge is stable (medical terminology, legal statutes) and the volume is manageable. But fine-tuning is expensive ($500 to $50,000 per training run for enterprise-scale datasets), slow (hours to days), and produces models that become stale the moment your data changes.
RAG keeps the base model unchanged and retrieves relevant context at query time. New documents are available for search within minutes of being indexed. Access controls can be enforced per query. Every answer can cite its sources. And the same base model serves all users, regardless of their department or data access level.
For enterprise applications where data changes frequently, where multiple user groups need different data access, and where auditability matters, RAG is the clear winner.
RAG Pipeline Architecture Overview
A production RAG pipeline consists of two main flows: the indexing flow (offline, processes documents into searchable embeddings) and the query flow (online, handles user questions in real time).
Indexing Flow
The indexing flow runs continuously or on a schedule. It ingests documents from source systems, processes them into chunks, generates embeddings for each chunk, and stores them in a vector database alongside metadata.
Query Flow
The query flow runs in real time. It receives a user question, transforms it into an effective retrieval query, searches the vector database for relevant chunks, re-ranks results for precision, constructs a prompt with the retrieved context, sends it to the language model, and returns the generated answer with citations.
Step 1: Data Ingestion
Connecting to Source Systems
Enterprise data lives in dozens of systems. Your RAG pipeline needs connectors for document stores (Google Drive, SharePoint, Confluence, Notion), communication tools (Slack, Microsoft Teams, email), databases (PostgreSQL, MySQL, Snowflake, BigQuery), code repositories (GitHub, GitLab), CRM systems (Salesforce, HubSpot), and project management tools (Jira, Asana, Linear).
Each connector must handle authentication (OAuth, API keys, service accounts), incremental sync (process only new or modified documents since the last sync), rate limiting (respect API quotas to avoid being throttled), and error handling (retry transient failures, skip permanently inaccessible documents, log issues).
Building connectors from scratch is one of the most time-consuming parts of a RAG implementation. Skopx provides over 200 pre-built connectors that handle authentication, sync scheduling, and error recovery, allowing teams to focus on the retrieval and generation layers.
Document Processing
Raw documents come in many formats: PDF, DOCX, HTML, Markdown, plain text, spreadsheets, presentations, and more. Each format requires a specialized parser that extracts text content while preserving structure (headings, tables, lists, code blocks).
For PDFs, use layout-aware parsing that distinguishes between headers, body text, captions, and tables. Simple text extraction loses structural information that is critical for accurate chunking and retrieval.
For HTML and web content, strip navigation, advertisements, and boilerplate while preserving the main content. Use readability heuristics or DOM analysis to identify the primary content area.
For spreadsheets and structured data, convert rows into natural language descriptions that embedding models can process effectively. A row in a sales database might become "Deal with Acme Corp for $150,000, currently in negotiation stage, owned by Sarah Chen, last activity on May 15."
Step 2: Chunking Strategies
Chunking is the most underrated component of a RAG pipeline. The quality of your chunks directly determines the quality of your retrieval, which in turn determines the quality of your answers. Get chunking wrong and no amount of prompt engineering will fix it.
Fixed-Size Chunking
The simplest approach splits documents into chunks of a fixed token count (typically 256 to 512 tokens) with overlap (typically 50 to 100 tokens). This is easy to implement and works reasonably well for homogeneous documents like articles or reports.
The limitation is that fixed-size chunks often split in the middle of a concept. A paragraph explaining a policy might be cut between the rule and its exception, leading to retrieval of incomplete information.
Recursive Chunking
Recursive chunking splits documents hierarchically: first by major sections (H1 headings), then by subsections (H2, H3), then by paragraphs, then by sentences. At each level, if the chunk exceeds the target size, it is split at the next level down.
This preserves semantic coherence because splits happen at natural document boundaries. A section about "Refund Policy" stays together as long as it fits within the size limit. Only when it exceeds the limit is it split into its component paragraphs.
Semantic Chunking
The most sophisticated approach uses embedding similarity to detect topic boundaries within documents. As you scan through a document, compute the embedding similarity between consecutive sentences. When similarity drops sharply, that indicates a topic shift and a good split point.
Semantic chunking produces the most coherent chunks but is more computationally expensive and harder to debug. It works best for long, unstructured documents like meeting transcripts or research papers where structural cues (headings) are absent.
Practical Recommendation
For most enterprise RAG pipelines, recursive chunking with a target size of 400 to 600 tokens and 10 to 15% overlap provides the best balance of quality and simplicity. Add parent-document metadata to each chunk (document title, section heading, source URL) so the generation model can provide accurate citations.
Step 3: Embedding Models
The embedding model converts text chunks into high-dimensional vectors that capture semantic meaning. Two chunks about "employee onboarding" will have similar vectors even if they use completely different words.
Choosing an Embedding Model
Key factors in model selection include dimensional count (768, 1024, or 1536 dimensions are common; higher dimensions capture more nuance but use more storage), maximum input length (some models truncate at 512 tokens, others handle 8,192), multilingual support (critical for global organizations), and domain adaptation (general models vs. models fine-tuned for legal, medical, or technical content).
For most enterprise use cases, a general-purpose model with 1024 to 1536 dimensions and support for 8,192 input tokens provides strong performance out of the box. Fine-tuning on domain-specific data can improve results by 5 to 15% for specialized use cases.
Embedding Pipeline Design
Batch your embedding calls for efficiency. Processing one chunk at a time wastes API round-trips. Most embedding APIs support batches of 64 to 256 chunks per call.
Store embeddings alongside rich metadata in your vector database. For each chunk, store the vector, the original text, the source document ID, the chunk position within the document, the section heading, the document creation and modification dates, and access control tags.
Step 4: Vector Storage and Indexing
Vector Database Selection
Your vector database is the backbone of the retrieval layer. Key requirements include approximate nearest neighbor (ANN) search with sub-100ms latency at scale, metadata filtering (to enforce access controls and scope queries), hybrid search (combining vector similarity with keyword matching), horizontal scaling (as your document corpus grows), and real-time index updates (new documents should be searchable within minutes).
Index Configuration
Configure your vector index for the appropriate distance metric (cosine similarity is standard for normalized text embeddings), the right index type (HNSW provides the best latency-accuracy tradeoff for most workloads), and the correct parameters (ef_construction and M for HNSW, which control build-time accuracy vs. index size).
Metadata Indexing
In addition to vector search, index metadata fields for filtering. Create indexes on source_system, document_type, created_date, modified_date, and access_control_tags. These enable scoped queries like "search only engineering documents from the last 90 days."
Step 5: Query Processing
Query Transformation
The user's raw question is rarely the optimal retrieval query. A query transformation pipeline improves retrieval quality through several techniques.
Query expansion adds related terms. "Revenue forecasting" becomes "revenue forecasting, financial projections, sales predictions, demand planning." This improves recall by matching chunks that discuss the same concept using different terminology.
Hypothetical document embedding (HyDE) generates a hypothetical answer to the question, then uses that answer's embedding for retrieval. This is effective because the hypothetical answer is more semantically similar to real document chunks than the short question is.
Query decomposition breaks complex questions into sub-questions. "How does our refund policy compare to competitors and what are the financial implications?" becomes two queries: "What is our refund policy?" and "What are the financial implications of refund policies?" Results from both queries are combined.
Hybrid Retrieval
Pure vector search misses exact-match queries. A search for error code "ERR-4502" will not work well with embeddings because the model treats it as arbitrary text. Keyword search (BM25) handles these queries perfectly.
Implement hybrid retrieval that runs both vector search and keyword search in parallel, then merges results. The standard approach is reciprocal rank fusion (RRF), which combines rankings from both methods with configurable weights. Start with equal weights and adjust based on evaluation results.
Step 6: Re-Ranking
The initial retrieval step prioritizes recall (finding all potentially relevant chunks). Re-ranking prioritizes precision (putting the most relevant chunks at the top).
A cross-encoder re-ranker takes each (query, chunk) pair and produces a relevance score. Unlike embedding similarity (which computes query and chunk embeddings independently), the cross-encoder processes both together, enabling deeper relevance assessment.
Re-ranking typically operates on the top 20 to 50 results from the retrieval step and outputs the top 5 to 10 most relevant chunks. This dramatically improves the quality of context sent to the generation model.
Step 7: Prompt Construction and Generation
Context Assembly
Assemble the prompt from four components: a system instruction that defines the agent's behavior, the user's question, the retrieved and re-ranked context chunks (with source citations), and any conversation history for multi-turn interactions.
Order matters. Place the most relevant chunks closest to the question (models pay more attention to content near the end of the context window). Include source metadata (document title, URL) with each chunk so the model can generate proper citations.
Generation Configuration
Set temperature to 0 to 0.3 for factual question answering. Higher temperatures introduce unnecessary variation. Set max output tokens based on the expected answer length (500 to 1000 tokens for most enterprise queries). Include explicit instructions to cite sources and to acknowledge when the provided context is insufficient to answer the question.
Skopx's RAG pipeline handles context assembly and citation generation automatically, optimizing chunk ordering and prompt construction based on the query type.
Step 8: Evaluation and Continuous Improvement
Automated Evaluation Metrics
Measure your RAG pipeline along four dimensions.
Retrieval quality: For a set of test questions with known relevant documents, measure recall at k (what percentage of relevant documents appear in the top k results) and mean reciprocal rank (how high the first relevant result ranks).
Answer quality: Use LLM-as-judge evaluation to score generated answers for faithfulness (does the answer accurately reflect the retrieved context?), relevance (does the answer address the user's question?), and completeness (does the answer cover all aspects of the question?).
Latency: Measure end-to-end response time and per-component latency. Identify bottlenecks in the pipeline.
Cost: Track token usage per query, cost per query, and cost trends over time.
Human Evaluation
Automated metrics are necessary but not sufficient. Run weekly human evaluation sessions where domain experts review 50 to 100 randomly sampled queries, rate the answers, and identify systematic issues. This catches quality problems that automated metrics miss.
Iteration Cycle
Based on evaluation results, prioritize improvements. Common areas for iteration include chunking parameters (adjust size and overlap), retrieval configuration (tune hybrid search weights), re-ranker training (fine-tune on your domain data), prompt engineering (refine system instructions), and data quality (improve ingestion for poorly performing source types).
Common Enterprise RAG Challenges
Challenge 1: Stale Data
Documents change but the index does not reflect updates. Solution: implement change detection in your connectors and trigger re-indexing when documents are modified. For critical data sources, use real-time sync. For others, schedule hourly or daily re-indexing.
Challenge 2: Cross-Document Reasoning
Users ask questions that require synthesizing information from multiple documents. "Summarize all customer feedback about Feature X from the last quarter." The retrieval layer returns fragments from many documents, and the generation model must weave them into a coherent summary. Solution: increase the number of retrieved chunks for synthesis queries and use a model with a large context window.
Challenge 3: Handling Tables and Structured Data
Tables in PDFs and documents are notoriously difficult to chunk and retrieve correctly. Solution: use table-aware parsing that extracts tables as structured data, generate natural language descriptions of each table, and store both the structured data and the description. At retrieval time, the description enables semantic search, and the structured data enables precise answers.
Challenge 4: Multi-Tenancy and Access Control
In enterprise environments, different users should see different data. Solution: tag every chunk with access control metadata during ingestion and filter at retrieval time based on the requesting user's permissions. This ensures that the RAG pipeline never leaks data across permission boundaries.
Conclusion
Building a production RAG pipeline requires attention to every layer of the stack: ingestion, chunking, embedding, retrieval, re-ranking, generation, and evaluation. Each layer has meaningful design decisions that affect the overall quality of your AI application.
The good news is that the architecture is well-understood, the tooling has matured significantly, and platforms like Skopx can shortcut much of the infrastructure work. Whether you build from scratch or leverage existing platforms, the principles in this guide will help you deliver a RAG pipeline that provides accurate, traceable, and secure answers from your enterprise data.
Start with a focused use case, build the evaluation framework early, and iterate aggressively based on real user feedback. The best RAG pipelines are not designed on a whiteboard. They are refined through hundreds of iterations driven by real-world queries and measurable quality metrics.
Alexis Kelly
The Skopx engineering and product team