RAG for Business: The Complete Guide to AI-Powered Customer Support
Author: Jason McDonald, Founder Reading Time: 45 minutes Last Updated: January 2026
Introduction: Why Chatbots Fail (And How to Fix Them)
We analyzed 10,000 support tickets from B2B SaaS companies. Here's what we found: 62% of customer questions were already answered in the documentation.
Not "sort of" answered. Not "partially" covered. Explicitly documented with step-by-step instructions, screenshots, and code examples.
Yet customers still opened tickets. And your support team still spent 8 hours a day copy-pasting from your docs into Intercom.
Why?
Because your chatbot is fundamentally broken. It's a decision tree masquerading as intelligence. When a customer asks "How do I integrate your API with Salesforce?", your bot can either:
- Show them a canned response you wrote 18 months ago (which is now outdated)
- Escalate to a human (defeating the entire purpose)
- Hallucinate an answer that sounds confident but is completely wrong
This is the hallucination problem, and it's why 73% of companies who implemented chatbots in 2022-2023 saw zero reduction in support ticket volume.
But there's a better way. It's called Retrieval-Augmented Generation (RAG), and it's the difference between a chatbot that irritates your customers and one that actually resolves issues.
This guide will teach you:
- How RAG works (and why it prevents hallucination)
- The architecture of embeddings and vector databases
- How to build a production RAG system from scratch
- Real implementation patterns we use at PipeCrush
By the end, you'll understand why RAG isn't just "a better chatbot"—it's the foundation for turning your support function into an always-on solutions engineering team.
Let's start with the fundamentals.
Part 1: Understanding RAG
What is Retrieval-Augmented Generation?
Retrieval-Augmented Generation is a hybrid approach that combines the search capabilities of traditional information retrieval with the language understanding of large language models (LLMs).
Here's the problem RAG solves:
Pure LLMs (like ChatGPT without plugins) are trained on massive datasets, but they:
- Don't know about your product, your docs, or your procedures
- Can't access real-time information
- "Hallucinate" plausible-sounding but incorrect answers when they don't know something
Traditional search (keyword matching) is deterministic and grounded in your actual content, but:
- Requires exact keyword matches
- Can't understand intent or context
- Returns documents, not conversational answers
RAG combines both approaches:
- Retrieval Phase: When a user asks a question, the system searches your knowledge base using semantic similarity (not just keywords) and retrieves the most relevant chunks of information
- Augmentation Phase: Those retrieved chunks are injected into the LLM's context as "ground truth"
- Generation Phase: The LLM generates a natural language answer based on the retrieved information, not its training data
The result: Conversational, context-aware answers that are grounded in your actual documentation. No hallucination.
The RAG Pipeline
Here's what happens when a customer asks "How do I reset my password?":
Step 1: Query Processing
User input: "How do I reset my password?"
Step 2: Embedding
# Convert the question to a vector
query_embedding = embedding_model.encode("How do I reset my password?")
# Returns: [0.234, -0.123, 0.456, ..., 0.789] # 1536-dimensional vector
Step 3: Retrieval
-- Find similar documents using vector similarity
SELECT content, metadata, (1 - (embedding <=> $query_embedding)) as similarity
FROM knowledge_base
WHERE 1 - (embedding <=> $query_embedding) > 0.7
ORDER BY embedding <=> $query_embedding
LIMIT 5;
Returns:
1. "Password Reset Procedure" (similarity: 0.94)
2. "Account Security Settings" (similarity: 0.82)
3. "Two-Factor Authentication Setup" (similarity: 0.76)
Step 4: Augmentation
System Prompt: You are a support assistant for PipeCrush.
Context (retrieved from knowledge base):
---
[Chunk 1: Password Reset Procedure]
To reset your password:
1. Navigate to Settings > Security
2. Click "Change Password"
3. Enter current password
4. Enter new password (min 12 characters)
5. Click "Update Password"
---
User Question: How do I reset my password?
Step 5: Generation
LLM Response:
To reset your password in PipeCrush:
1. Go to Settings > Security
2. Click "Change Password"
3. Enter your current password
4. Create a new password (minimum 12 characters required)
5. Click "Update Password"
Your password will be updated immediately and you'll remain logged in on your current device.
Notice the difference: The LLM didn't know the password reset procedure from its training data. It generated the answer based on the retrieved documentation.
Latency Considerations
The entire pipeline typically takes:
- Embedding generation: 20-50ms
- Vector search: 10-30ms
- LLM generation: 500-2000ms (depends on response length)
Total: ~1-2 seconds for most queries. Fast enough for real-time chat.
Key Terminology
Before we go deeper, let's define the core concepts:
Embeddings: Numerical representations of text. "How do I reset my password?" becomes a vector like [0.234, -0.123, 0.456, ...]. Similar meanings = similar vectors.
Vector Similarity: The mathematical measure of how "close" two embeddings are. Common methods:
- Cosine similarity (most popular)
- Euclidean distance
- Dot product
Chunking: Breaking documents into smaller pieces. A 5,000-word guide might become 50 chunks of ~200 words each. Why? Because:
- Embeddings work better on focused content
- You don't want to retrieve the entire guide when only one section is relevant
- LLM context windows are limited (you can't paste your entire knowledge base)
Context Window: The amount of text an LLM can process at once. GPT-4 has a 128K token context window (~96,000 words), but in practice, you'll use 4-8K tokens for most RAG queries.
Now that we understand what RAG is, let's dive into how it works—starting with embeddings.
Part 2: Embeddings Deep Dive
How Text Becomes Vectors
The concept of representing words as numbers isn't new. In 2013, researchers at Google published Word2Vec, which learned to represent words as vectors such that:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
This was revolutionary. It meant you could do math on language.
But Word2Vec had a problem: it only understood individual words. "Bank" had the same vector whether you meant a river bank or a financial institution.
Enter sentence embeddings (2018-2019):
Models like BERT and Sentence-BERT learned to create vectors for entire sentences, capturing:
- Context (is "bank" near "river" or "money"?)
- Intent (is this a question, statement, or command?)
- Semantic meaning (these two sentences mean the same thing even though they use different words)
Example:
"How do I reset my password?"
"What's the process for changing my login credentials?"
These have different words but similar meaning. Good sentence embeddings produce similar vectors for both.
Document embeddings extend this further:
You can embed entire paragraphs or documents. This is what powers RAG: your knowledge base articles become vectors, and when a user asks a question, you find the most similar vectors.
Embedding Models Comparison
Here are the most popular embedding models for RAG systems as of 2026:
OpenAI text-embedding-3-small
- Dimensions: 1536
- Cost: $0.02 per 1M tokens
- Speed: ~50ms per query
- Quality: Excellent for general-purpose RAG
- Max input: 8,191 tokens
OpenAI text-embedding-3-large
- Dimensions: 3072
- Cost: $0.13 per 1M tokens
- Quality: Best-in-class accuracy
- Trade-off: 6.5x more expensive, larger storage
Cohere embed-english-v3.0
- Dimensions: 1024
- Cost: $0.10 per 1M tokens
- Specialty: Multi-lingual support
- Feature: Built-in compression for reduced storage
Open-source: all-MiniLM-L6-v2
- Dimensions: 384
- Cost: Free (self-hosted)
- Speed: Very fast
- Trade-off: Lower accuracy than commercial models
Our recommendation for most SaaS companies:
Start with OpenAI text-embedding-3-small. Here's why:
- Cost-effective: At $0.02/1M tokens, even processing 10,000 documents costs ~$2
- Fast: 50ms embedding time doesn't bottleneck your pipeline
- High quality: Good enough for 95% of use cases
- Easy integration: OpenAI SDK is mature and well-documented
You can always upgrade to text-embedding-3-large later if you need higher accuracy (we'll cover evaluation metrics in Part 7).
Choosing the Right Model
The decision matrix:
| Priority | Model Choice | Reason |
|---|---|---|
| Cost optimization | all-MiniLM-L6-v2 (self-hosted) | Free inference, lower storage (384 dims) |
| Highest accuracy | text-embedding-3-large | Best retrieval quality, worth the cost for high-value use cases |
| Multilingual | Cohere embed-multilingual | Trained on 100+ languages |
| Speed | text-embedding-3-small | Best balance of speed, cost, and quality |
Don't overthink this.
The difference between text-embedding-3-small and text-embedding-3-large matters when you're building semantic search for legal contracts or medical records. For B2B SaaS support documentation, the "small" model is perfect.
Now let's talk about where you store these embeddings.
Part 3: Vector Database Selection
What is a Vector Database?
Traditional databases store data in rows and tables. You query with SQL:
SELECT * FROM articles WHERE title LIKE '%password%';
This works great for exact matches. But it can't answer "show me articles similar to this question."
Vector databases are specialized for similarity search. You query with a vector:
SELECT * FROM articles
ORDER BY embedding <=> '[0.234, -0.123, ...]'
LIMIT 10;
The <=> operator is the cosine distance operator (in PostgreSQL's pgvector extension). It returns documents sorted by similarity.
Under the hood, vector databases use specialized indexes like HNSW (Hierarchical Navigable Small World) to make similarity search fast—even with millions of vectors.
Without these indexes, finding the most similar vector would require comparing your query to every single vector in the database. With 1 million documents, that's 1 million cosine distance calculations per query. Unacceptable.
HNSW reduces this to a few hundred comparisons while maintaining 95%+ accuracy. This is why vector databases exist.
Vector Database Comparison
Let's compare the major options:
Pinecone (Managed cloud service)
- Pros: Zero DevOps, scales automatically, excellent DX
- Cons: Expensive at scale ($70/mo minimum, $300+/mo for production), vendor lock-in
- Best for: Teams who want to move fast and don't want to manage infrastructure
Weaviate (Open-source, hybrid search)
- Pros: Built-in hybrid search (vector + keyword), self-hostable, GraphQL API
- Cons: More complex setup, requires K8s for production
- Best for: Teams who need hybrid search and have DevOps capacity
Qdrant (Open-source, Rust-based)
- Pros: Extremely fast, efficient filtering, good documentation
- Cons: Smaller ecosystem, self-hosting required
- Best for: Performance-critical applications, teams comfortable with self-hosting
pgvector (PostgreSQL extension)
- Pros: Runs inside your existing PostgreSQL database, zero new infrastructure, ACID transactions
- Cons: Slower than specialized vector DBs at massive scale (10M+ vectors)
- Best for: Most B2B SaaS companies (you already have PostgreSQL)
Chroma (Open-source, embedded)
- Pros: Embeds in your application, great for local development
- Cons: Not production-ready for multi-user applications
- Best for: Prototyping, personal projects
Our Recommendation: pgvector
At PipeCrush, we use pgvector with NeonDB (PostgreSQL). Here's why:
1. You already have PostgreSQL
You're already running Postgres for your users, organizations, and transactions. Why add a second database?
With pgvector, your vector search runs in the same database as your operational data. This means:
- Joins: Combine vector search with traditional SQL filters
- Transactions: Consistent data across your entire schema
- One backup: Don't manage backups for two databases
2. Good enough performance
pgvector with HNSW indexes handles 1-10 million vectors with sub-50ms query times. Most B2B SaaS companies have 10,000-100,000 knowledge base chunks. You won't hit scaling limits.
3. Cost
Pinecone's starter plan: $70/month Qdrant Cloud: $25/month minimum pgvector on NeonDB: Included in your existing database (we pay ~$50/mo for our entire production DB)
4. Simplicity
Install pgvector:
CREATE EXTENSION vector;
Create a table:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
);
Create an index:
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
Done. No new infrastructure. No Kubernetes. No vendor-specific API.
When you might need a specialized vector DB:
- You have 10M+ vectors
- You need sub-10ms query latency
- You're doing real-time vector search at massive scale (think: Google-scale semantic search)
For customer support RAG? pgvector is perfect.
Part 4: Knowledge Base Architecture
Now that you understand embeddings and vector databases, let's talk about how to structure your actual knowledge base.
Document Ingestion
Your RAG system needs to ingest content from multiple sources:
Structured documentation:
- Markdown files (from your /docs site)
- Confluence/Notion pages
- Google Docs
- PDF guides
Product content:
- API reference docs
- Changelog
- Release notes
Support content:
- Resolved tickets (anonymized)
- FAQ articles
- How-to guides
The ingestion pipeline looks like this:
async function ingestDocument(source: DocumentSource) {
// 1. Extract text content
const rawText = await extractText(source);
// 2. Clean and normalize
const cleanText = normalizeWhitespace(rawText);
// 3. Extract metadata
const metadata = {
source: source.url,
title: source.title,
lastUpdated: source.modifiedAt,
category: source.category,
};
// 4. Chunk the document
const chunks = await chunkDocument(cleanText, {
maxChunkSize: 500,
overlap: 50,
});
// 5. Generate embeddings for each chunk
for (const chunk of chunks) {
const embedding = await generateEmbedding(chunk.text);
// 6. Store in vector database
await db.documents.create({
data: {
content: chunk.text,
embedding: embedding,
metadata: metadata,
chunkIndex: chunk.index,
},
});
}
}
Supported formats:
| Format | Library | Notes |
|---|---|---|
| Markdown | marked, remark | Preserve headers for metadata |
| HTML | cheerio, htmlparser2 | Strip navigation, keep main content |
| pdf-parse, pdfjs | Text extraction only (images require OCR) | |
| DOCX | mammoth, docx-parser | Good for internal docs |
| Plain text | Built-in | Simplest case |
Metadata handling:
Every chunk should store metadata for filtering and attribution:
interface ChunkMetadata {
sourceUrl: string; // Link back to original doc
title: string; // Document title
section?: string; // H2/H3 heading this chunk falls under
category: string; // "API", "Guides", "Troubleshooting"
lastUpdated: Date; // For staleness detection
accessLevel?: string; // "public", "customer", "internal"
}
Why? Because when you retrieve a chunk, you want to:
- Show the user where this information came from (source attribution)
- Filter results by category ("only search API docs")
- Exclude stale content ("ignore docs older than 6 months")
Chunking Strategies
This is where most RAG implementations fail. Chunking is the most important decision you'll make.
Too small: You lose context. A chunk about "Step 3: Click Submit" is useless without Steps 1 and 2.
Too large: You get retrieval noise. You ask "How do I reset my password?" and retrieve a 5,000-word security guide that mentions passwords 47 times.
Here are the main approaches:
Fixed-size chunking
Split every document into N-character chunks:
function fixedSizeChunk(text: string, size: number, overlap: number) {
const chunks = [];
for (let i = 0; i < text.length; i += size - overlap) {
chunks.push(text.slice(i, i + size));
}
return chunks;
}
Pros: Simple, predictable Cons: Breaks mid-sentence, mid-paragraph, mid-thought
Semantic chunking
Use NLP to split at natural boundaries (paragraphs, sections):
function semanticChunk(text: string) {
const sections = text.split(/\n#{2,}\s/); // Split on markdown headers
return sections.map(section => {
const paragraphs = section.split(/\n\n/);
// Group paragraphs until you hit size limit
return groupParagraphs(paragraphs, { maxSize: 500 });
});
}
Pros: Preserves meaning, natural boundaries Cons: Variable chunk sizes, more complex
Recursive character splitting (Our recommendation)
Try to split on sentence boundaries, fall back to character splitting if needed:
function recursiveChunk(text: string, maxSize: number) {
// Try to split on double newline (paragraphs)
if (text.length <= maxSize) return [text];
const paragraphs = text.split(/\n\n/);
if (paragraphs.every(p => p.length <= maxSize)) {
return paragraphs;
}
// Fall back to sentence splitting
const sentences = text.split(/\.\s/);
return groupSentences(sentences, maxSize);
}
Pros: Best balance of coherence and size control Cons: Requires sentence tokenization
Our configuration at PipeCrush:
{
maxChunkSize: 500, // Characters (not tokens)
overlap: 50, // 10% overlap to preserve context
respectBoundaries: true, // Don't break mid-sentence
splitOn: ['\n\n', '. '], // Prefer paragraph then sentence splits
}
This gives us chunks of 100-150 words, which is the sweet spot for GPT-4 retrieval.
The Chunking Paradox
Here's the paradox: The best chunk size depends on the question.
Example document:
# API Authentication
Our API uses JWT tokens. To authenticate:
1. POST to /auth/login with email and password
2. Receive a JWT token in the response
3. Include the token in the Authorization header: `Bearer <token>`
Tokens expire after 24 hours. To refresh, call /auth/refresh.
Question 1: "How do I authenticate with the API?" → Best answer: The entire document (all context matters)
Question 2: "How long do API tokens last?" → Best answer: Just the sentence "Tokens expire after 24 hours"
The solution: Retrieve multiple chunk sizes in parallel
// Create embeddings at multiple granularities
await createChunks(document, { size: 200, name: 'small' });
await createChunks(document, { size: 500, name: 'medium' });
await createChunks(document, { size: 1000, name: 'large' });
// At query time, search all sizes and pick the best match
const results = await Promise.all([
search(query, { chunkSize: 'small', limit: 3 }),
search(query, { chunkSize: 'medium', limit: 3 }),
search(query, { chunkSize: 'large', limit: 2 }),
]);
// Re-rank by similarity score
return results.flat().sort((a, b) => b.similarity - a.similarity).slice(0, 5);
This is called hierarchical retrieval, and it's how production RAG systems handle the chunking paradox.
Keeping Knowledge Fresh
Your documentation changes. Your product evolves. Your RAG system needs to stay in sync.
Update pipeline:
// Webhook from your docs platform (e.g., Notion, Confluence)
app.post('/webhooks/docs/updated', async (req) => {
const { documentId, url } = req.body;
// 1. Delete old chunks for this document
await db.documents.deleteMany({
where: { metadata: { sourceUrl: url } },
});
// 2. Re-ingest the updated document
await ingestDocument({ url, documentId });
// 3. Invalidate any cached responses mentioning this doc
await cache.invalidate({ sourceUrl: url });
});
Versioning:
For critical docs (API references, legal terms), keep historical versions:
interface Document {
id: string;
content: string;
embedding: number[];
version: number;
validFrom: Date;
validUntil: Date | null;
}
// When querying, filter by date
const results = await db.documents.findMany({
where: {
AND: [
{ validFrom: { lte: new Date() } },
{ OR: [
{ validUntil: null },
{ validUntil: { gte: new Date() } },
]},
],
},
});
This way, if a customer asks "How did authentication work in v1.2?", you can retrieve the historical docs.
Incremental updates:
Don't re-embed your entire knowledge base every time one doc changes. Use a job queue:
// When a doc is updated
await queue.add('embed-document', { documentId });
// Worker processes one doc at a time
worker.process('embed-document', async (job) => {
const { documentId } = job.data;
await ingestDocument(documentId);
});
At PipeCrush, we re-embed ~500 KB of docs per day. This costs $0.01/day in embedding fees. Not worth optimizing further.
Part 5: The Retrieval Layer
You've embedded your knowledge base. Now let's talk about retrieving the right chunks when a user asks a question.
Semantic Search
The core of RAG is semantic similarity search:
async function semanticSearch(query: string, limit: number = 5) {
// 1. Embed the user's query
const queryEmbedding = await embeddings.create({
model: 'text-embedding-3-small',
input: query,
});
// 2. Find similar documents using vector distance
const results = await db.$queryRaw`
SELECT
id,
content,
metadata,
1 - (embedding <=> ${queryEmbedding}::vector) as similarity
FROM documents
WHERE 1 - (embedding <=> ${queryEmbedding}::vector) > 0.7
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT ${limit}
`;
return results;
}
Similarity threshold tuning:
The WHERE similarity > 0.7 filter is critical. Too low (0.5) and you get irrelevant results. Too high (0.9) and you get nothing.
Our tuning process:
- Create a test set: 50 common questions with known correct answers
- Run retrieval at different thresholds:
- 0.6: 48/50 correct answers retrieved, but 12 false positives
- 0.7: 47/50 correct, 3 false positives ← sweet spot
- 0.8: 42/50 correct, 0 false positives ← too strict
- Monitor in production: Track when users say "that didn't answer my question"
Result ranking:
pgvector returns results sorted by distance. But you might want to re-rank based on:
- Recency: Newer docs are more likely to be accurate
- Authority: Official docs > community forum posts
- User feedback: Docs with high "helpful" votes rank higher
function rerank(results: SearchResult[]) {
return results.map(result => ({
...result,
score: (
result.similarity * 0.7 + // 70% weight on similarity
result.recencyScore * 0.2 + // 20% weight on recency
result.authorityScore * 0.1 // 10% weight on authority
),
})).sort((a, b) => b.score - a.score);
}
Hybrid Search
Semantic search is powerful, but it has a weakness: exact term matching.
Example:
User query: "What's the error code for invalid API key?" Best answer: "Error 401: Invalid API credentials"
Problem: "401" and "invalid API key" are semantically similar, but semantic search might miss the exact error code if it's ranking by meaning alone.
Solution: Hybrid search (semantic + keyword)
Combine vector similarity with traditional full-text search:
async function hybridSearch(query: string) {
// Semantic search
const semanticResults = await db.$queryRaw`
SELECT *, 1 - (embedding <=> ${queryEmbedding}::vector) as semantic_score
FROM documents
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT 20
`;
// Keyword search (PostgreSQL full-text search)
const keywordResults = await db.$queryRaw`
SELECT *, ts_rank(search_vector, plainto_tsquery(${query})) as keyword_score
FROM documents
WHERE search_vector @@ plainto_tsquery(${query})
ORDER BY keyword_score DESC
LIMIT 20
`;
// Combine results using Reciprocal Rank Fusion
return reciprocalRankFusion(semanticResults, keywordResults);
}
BM25 algorithm (for keyword scoring):
BM25 is the industry standard for text search. It's like TF-IDF but better. PostgreSQL's ts_rank uses a similar algorithm.
Fusion strategies:
How do you combine semantic and keyword results?
1. Weighted sum:
score = (semantic_score * 0.6) + (keyword_score * 0.4)
2. Reciprocal Rank Fusion (RRF):
// For each document, sum the reciprocal ranks from both searches
function RRF(semanticRank: number, keywordRank: number, k: number = 60) {
return (1 / (k + semanticRank)) + (1 / (k + keywordRank));
}
RRF is more robust because it doesn't require normalizing scores across different search methods.
When to use hybrid search:
- Your docs contain technical terms, error codes, or product-specific jargon
- Users frequently ask questions with exact phrases ("how to reset password" vs "password reset procedure")
- You're noticing semantic search missing obvious keyword matches
At PipeCrush, we use hybrid search for all support queries. It added ~15% accuracy with minimal complexity.
Advanced Retrieval
Beyond basic similarity search, here are techniques that move you from "good" to "production-grade":
Re-ranking with cross-encoders:
A cross-encoder is a model that takes both the query and document as input and outputs a relevance score. It's more accurate than comparing embeddings but 10x slower.
The trick: Use fast vector search to get 20 candidates, then re-rank with a cross-encoder:
// 1. Fast retrieval (20 results)
const candidates = await vectorSearch(query, { limit: 20 });
// 2. Slow but accurate re-ranking (top 5)
const reranked = await crossEncoder.rank(query, candidates);
return reranked.slice(0, 5);
This gives you the best of both worlds: fast initial retrieval + accurate final ranking.
Multi-query retrieval:
Users phrase questions in unpredictable ways. Generate multiple variations of the query:
// Original: "How do I reset my password?"
const variations = await llm.generate({
prompt: `Generate 3 variations of this question with the same meaning:
"${query}"`,
});
// Variations:
// - "What's the process for changing my password?"
// - "How can I update my login credentials?"
// - "Steps to reset account password"
// Search with all variations
const results = await Promise.all(
variations.map(v => vectorSearch(v, { limit: 3 }))
);
// Deduplicate and merge
return deduplicateResults(results.flat());
This catches edge cases where your original query phrasing doesn't match your docs.
Parent-child retrieval:
Store small chunks for retrieval accuracy, but return the parent context for LLM consumption:
interface Chunk {
id: string;
content: string; // Small chunk (200 chars)
parentContent: string; // Full section (1000 chars)
embedding: number[];
}
// Search using small chunks
const chunks = await vectorSearch(query);
// But pass the parent content to the LLM
const context = chunks.map(c => c.parentContent).join('\n\n');
This solves the chunking paradox: search with precision, answer with context.
Part 6: LLM Integration
You've retrieved the relevant chunks. Now you need to turn them into a conversational answer.
Prompt Engineering for RAG
The RAG prompt has three parts:
1. System prompt (who is the AI?)
You are a helpful support assistant for PipeCrush, a unified revenue platform for B2B SaaS companies.
Your role is to answer customer questions using the provided documentation.
Rules:
- Only answer based on the context provided below
- If the context doesn't contain the answer, say "I don't have that information in our docs"
- Always cite which section of the docs you're referencing
- Be concise but thorough
- Use markdown formatting for code examples
2. Context (the retrieved chunks)
Context from our documentation:
---
[Document 1: Password Reset Procedure]
To reset your password:
1. Navigate to Settings > Security
2. Click "Change Password"
3. Enter current password and new password
4. Click "Update Password"
---
[Document 2: Two-Factor Authentication]
If you have 2FA enabled, you'll need to enter your authentication code after changing your password.
---
3. User question (the actual query)
User Question: How do I reset my password?
Answer:
Full prompt template:
const prompt = `You are a helpful support assistant for ${company}.
Your role is to answer customer questions using the provided documentation.
Rules:
- Only answer based on the context provided below
- If the context doesn't contain the answer, say "I don't have that information in our docs. Let me connect you with a human."
- Always cite which section of the docs you're referencing
- Be concise but thorough
- Use markdown formatting for code examples
Context from our documentation:
---
${retrievedChunks.map((chunk, i) => `[Document ${i + 1}: ${chunk.metadata.title}]\n${chunk.content}`).join('\n\n---\n\n')}
---
User Question: ${userQuery}
Answer:`;
Citation requirements:
Always make the LLM cite its sources. This:
- Builds user trust ("oh, this is from the API docs")
- Lets users read the full article if they want more detail
- Helps you debug when the bot gives wrong answers (which doc did it pull from?)
Enforce citations in the system prompt:
After your answer, include a "Sources" section with links to the relevant documentation.
Example:
To reset your password, go to Settings > Security and click "Change Password".
Sources:
- Password Reset Guide (article #127 in knowledge base)
Managing the Context Window
GPT-4 Turbo has a 128K token context window. Sounds like a lot. But in practice:
- System prompt: ~200 tokens
- Retrieved context (5 chunks × 500 chars): ~3,000 tokens
- Conversation history (last 10 turns): ~2,000 tokens
- Response: ~500 tokens
Total: ~5,700 tokens per query
This is manageable. But you need to handle edge cases:
1. Too many retrieved chunks:
If you retrieve 20 chunks and each is 500 words, you're at 10K words = ~13K tokens. That's 10% of your context window.
Solution: Limit retrieved chunks to the top 5 most relevant.
2. Conversation history:
Multi-turn conversations accumulate tokens fast. After 20 turns, you might have 10K tokens of history.
Solution: Context compression (see below).
3. Long user queries:
Some users paste error logs (5,000 characters) into the chat.
Solution: Truncate or summarize:
if (userQuery.length > 1000) {
userQuery = await llm.summarize(userQuery, { maxLength: 500 });
}
Context compression:
For long conversations, use an LLM to summarize the history:
async function compressHistory(messages: Message[]) {
if (getTokenCount(messages) < 2000) return messages;
// Summarize older messages
const summary = await llm.generate({
prompt: `Summarize this conversation history in 200 words:
${messages.slice(0, -5).map(m => `${m.role}: ${m.content}`).join('\n')}`,
});
// Keep last 5 messages + summary
return [
{ role: 'system', content: `Previous conversation: ${summary}` },
...messages.slice(-5),
];
}
This keeps your context window lean while preserving conversation context.
Prioritization strategies:
If you have 10 retrieved chunks but can only fit 5, which do you keep?
function prioritizeChunks(chunks: Chunk[], maxTokens: number) {
// Sort by a composite score
const scored = chunks.map(chunk => ({
...chunk,
score: (
chunk.similarity * 0.7 + // How relevant?
chunk.recency * 0.2 + // How recent?
chunk.userFeedback * 0.1 // How helpful (historically)?
),
})).sort((a, b) => b.score - a.score);
// Take chunks until we hit token limit
let tokens = 0;
const selected = [];
for (const chunk of scored) {
const chunkTokens = estimateTokens(chunk.content);
if (tokens + chunkTokens > maxTokens) break;
tokens += chunkTokens;
selected.push(chunk);
}
return selected;
}
Response Generation
With your prompt ready, it's time to call the LLM:
async function generateResponse(query: string, context: string[]) {
const prompt = buildPrompt(query, context);
const response = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: prompt },
],
temperature: 0.3, // Low temp = more deterministic
max_tokens: 500, // Limit response length
stream: true, // Streaming for better UX
});
return response;
}
Streaming responses:
Users hate waiting 3 seconds for an answer. Stream the response token-by-token:
const stream = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: messages,
stream: true,
});
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
// Send token to frontend via WebSocket
ws.send(JSON.stringify({ type: 'token', content: token }));
}
This gives the illusion of faster response time (users see the first words in 200ms even if the full response takes 2 seconds).
Fallback handling:
What if retrieval returns no relevant chunks?
if (retrievedChunks.length === 0 || retrievedChunks[0].similarity < 0.65) {
return {
response: "I don't have enough information in our documentation to answer that question confidently. Let me connect you with a human who can help.",
escalate: true,
suggestedDocs: await getPopularDocs(query), // Show related articles
};
}
Never let the LLM hallucinate. If you don't have the answer, say so.
Confidence scoring:
Ask the LLM to rate its own confidence:
const prompt = `${ragPrompt}
After your answer, rate your confidence on a scale of 1-10 based on how well the context supports your answer.
Answer:`;
// Parse the confidence score from the response
const confidence = extractConfidence(response); // "Confidence: 8/10"
if (confidence < 6) {
// Flag for human review
await flagForReview(query, response, confidence);
}
Multi-Turn Conversations
RAG isn't just for one-off questions. Users have multi-turn conversations:
User: "How do I integrate with Salesforce?"
Bot: "You can integrate using our Salesforce connector..."
User: "What about API keys?"
Bot: [needs to understand "What about API keys for the Salesforce integration?"]
The second question has a reference ("API keys" in the context of Salesforce). You need to resolve this reference.
Conversation memory:
Store the last N turns and pass them to the LLM:
const conversationHistory = [
{ role: 'user', content: 'How do I integrate with Salesforce?' },
{ role: 'assistant', content: 'You can integrate using our Salesforce connector...' },
{ role: 'user', content: 'What about API keys?' },
];
// LLM uses history to understand "API keys" refers to Salesforce
const response = await generateResponse(conversationHistory);
Reference resolution:
For better retrieval, use the LLM to rewrite the query with full context:
const rewrittenQuery = await llm.generate({
prompt: `Rewrite the user's latest question to be standalone, incorporating context from the conversation history.
Conversation:
${conversationHistory.slice(0, -1).map(m => `${m.role}: ${m.content}`).join('\n')}
Latest question: ${latestQuery}
Standalone question:`,
});
// "What about API keys?" becomes "What API keys are needed for Salesforce integration?"
const chunks = await vectorSearch(rewrittenQuery);
This dramatically improves retrieval accuracy for multi-turn conversations.
Topic tracking:
Detect when the user switches topics:
const currentTopic = await detectTopic(conversationHistory);
const newTopic = await detectTopic([latestMessage]);
if (currentTopic !== newTopic) {
// User switched topics, clear the context window
conversationHistory = [latestMessage];
}
This prevents earlier conversation topics from polluting retrieval.
Part 7: Building a Support Chatbot
Let's put it all together. Here's how to build a production RAG-powered support chatbot from scratch.
Architecture Overview
┌─────────────────┐
│ User Question │
└────────┬────────┘
│
▼
┌─────────────────────┐
│ Query Rewriting │ (Multi-turn context resolution)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Embedding Model │ (text-embedding-3-small)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Vector Search │ (pgvector + hybrid search)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Re-ranking │ (Recency, authority, similarity)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Context Building │ (Top 5 chunks + metadata)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ LLM (GPT-4 Turbo) │ (RAG prompt + streaming)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Response + Sources │
└─────────────────────┘
Step-by-Step Implementation
Step 1: Knowledge Base Setup
Ingest your documentation:
# Install dependencies
npm install @langchain/openai @langchain/community pgvector
# Create database schema
npx prisma migrate dev --name add_rag_support
// prisma/schema.prisma
model KnowledgeChunk {
id String @id @default(cuid())
content String
embedding Unsupported("vector(1536)")
metadata Json
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([embedding(ops: vector_cosine_ops)], type: Hnsw)
}
Run the ingestion script:
// scripts/ingest-docs.ts
import { embedDocuments } from './lib/rag/embeddings';
const docs = await fetchAllDocs(); // From Notion, Confluence, etc.
for (const doc of docs) {
await embedDocuments(doc);
}
console.log(`Ingested ${docs.length} documents`);
Step 2: Embedding Pipeline
// lib/rag/embeddings.ts
import OpenAI from 'openai';
const openai = new OpenAI();
export async function embedText(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
}
export async function embedDocuments(doc: Document) {
const chunks = chunkDocument(doc.content);
for (const chunk of chunks) {
const embedding = await embedText(chunk.text);
await prisma.knowledgeChunk.create({
data: {
content: chunk.text,
embedding: embedding,
metadata: {
sourceUrl: doc.url,
title: doc.title,
section: chunk.section,
category: doc.category,
},
},
});
}
}
Step 3: Retrieval API
// app/api/chat/retrieve/route.ts
import { NextRequest } from 'next/server';
import { embedText } from '@/lib/rag/embeddings';
import { prisma } from '@/lib/db';
export async function POST(req: NextRequest) {
const { query } = await req.json();
// 1. Embed the query
const queryEmbedding = await embedText(query);
// 2. Vector search
const chunks = await prisma.$queryRaw`
SELECT
content,
metadata,
1 - (embedding <=> ${queryEmbedding}::vector) as similarity
FROM "KnowledgeChunk"
WHERE 1 - (embedding <=> ${queryEmbedding}::vector) > 0.7
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT 5
`;
return Response.json({ chunks });
}
Step 4: Chat Interface
// app/api/chat/route.ts
import OpenAI from 'openai';
const openai = new OpenAI();
export async function POST(req: NextRequest) {
const { messages } = await req.json();
const latestQuery = messages[messages.length - 1].content;
// 1. Retrieve relevant chunks
const chunks = await retrieveChunks(latestQuery);
// 2. Build RAG prompt
const systemPrompt = buildSystemPrompt(chunks);
// 3. Generate response
const stream = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [
{ role: 'system', content: systemPrompt },
...messages,
],
stream: true,
});
// 4. Stream response to frontend
return new Response(
new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
controller.enqueue(new TextEncoder().encode(token));
}
controller.close();
},
}),
);
}
Frontend:
// components/ChatWidget.tsx
'use client';
import { useState } from 'react';
export function ChatWidget() {
const [messages, setMessages] = useState([]);
const [input, setInput] = useState('');
const sendMessage = async () => {
const newMessages = [...messages, { role: 'user', content: input }];
setMessages(newMessages);
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages: newMessages }),
});
const reader = response.body.getReader();
let assistantMessage = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
const token = new TextDecoder().decode(value);
assistantMessage += token;
setMessages([...newMessages, { role: 'assistant', content: assistantMessage }]);
}
};
return (
<div className="chat-widget">
<div className="messages">
{messages.map((msg, i) => (
<div key={i} className={msg.role}>
{msg.content}
</div>
))}
</div>
<input
value={input}
onChange={(e) => setInput(e.target.value)}
onKeyDown={(e) => e.key === 'Enter' && sendMessage()}
/>
</div>
);
}
Step 5: Analytics
Track what's working:
// lib/analytics/rag.ts
export async function logQuery(query: string, response: string, chunks: Chunk[]) {
await prisma.chatAnalytics.create({
data: {
query,
response,
chunksRetrieved: chunks.length,
topSimilarity: chunks[0]?.similarity,
responseTime: Date.now() - startTime,
},
});
}
// Track user feedback
export async function logFeedback(queryId: string, helpful: boolean) {
await prisma.chatAnalytics.update({
where: { id: queryId },
data: { helpful },
});
}
Dashboard queries:
-- Questions with low similarity scores (need better docs)
SELECT query, top_similarity
FROM chat_analytics
WHERE top_similarity < 0.7
ORDER BY created_at DESC
LIMIT 50;
-- Queries marked as not helpful (need prompt tuning)
SELECT query, response
FROM chat_analytics
WHERE helpful = false
ORDER BY created_at DESC
LIMIT 50;
-- Most common questions (prioritize for docs)
SELECT query, COUNT(*) as count
FROM chat_analytics
GROUP BY query
ORDER BY count DESC
LIMIT 20;
Performance Optimization
1. Caching:
Cache embeddings for common queries:
const cache = new Map<string, number[]>();
export async function embedText(text: string): Promise<number[]> {
if (cache.has(text)) {
return cache.get(text)!;
}
const embedding = await generateEmbedding(text);
cache.set(text, embedding);
return embedding;
}
2. Batch processing:
Embed multiple documents in parallel:
const chunks = await Promise.all(
documents.map(doc => chunkDocument(doc))
);
const embeddings = await Promise.all(
chunks.flat().map(chunk => embedText(chunk.text))
);
3. Latency reduction:
Parallel retrieval and LLM calls:
// Don't wait for retrieval before calling LLM
const [chunks] = await Promise.all([
retrieveChunks(query),
// Start LLM streaming immediately with a generic prompt
warmupLLM(),
]);
// Then inject retrieved context
const response = await generateResponse(query, chunks);
This shaves 200ms off total latency.
Measuring Success
Track these metrics:
| Metric | Target | How to Measure |
|---|---|---|
| Retrieval Accuracy | >85% | Manual eval: Does top chunk answer the question? |
| Response Accuracy | >90% | Manual eval: Is the generated answer correct? |
| Resolution Rate | >70% | % of conversations that don't escalate to human |
| User Satisfaction | >4.0/5.0 | Thumbs up/down on bot responses |
| Latency | <2s | Time from query to first token |
Evaluation harness:
// scripts/evaluate.ts
const testQueries = [
{ query: 'How do I reset my password?', expectedDoc: 'password-reset-guide' },
{ query: 'What are your API rate limits?', expectedDoc: 'api-reference' },
// ... 50 more
];
let correctRetrievals = 0;
for (const test of testQueries) {
const chunks = await retrieveChunks(test.query);
if (chunks[0].metadata.slug === test.expectedDoc) {
correctRetrievals++;
}
}
console.log(`Retrieval accuracy: ${correctRetrievals / testQueries.length * 100}%`);
Run this weekly to catch regressions.
Part 8: Beyond Support
RAG isn't just for customer support. Here are other high-value applications:
Sales Enablement
Use case: Sales reps need instant answers during calls.
Implementation:
// Slack bot that responds to @sales-assistant
app.message(async ({ message, say }) => {
const query = message.text.replace('@sales-assistant', '').trim();
// Search sales playbooks, competitor intel, pricing docs
const chunks = await retrieveChunks(query, {
categories: ['sales-playbook', 'competitive-intel', 'pricing'],
});
const response = await generateResponse(query, chunks);
await say({
text: response,
thread_ts: message.ts, // Reply in thread
});
});
Knowledge sources:
- Competitive battle cards
- Pricing strategy docs
- Product positioning guides
- Case studies and ROI calculators
ROI: Sales reps spend 15% of their time searching for answers. RAG reduces this to <5%.
Internal Knowledge Management
Use case: Employees need to find company policies, procedures, and tribal knowledge.
Implementation:
// Internal wiki search
const internalKB = await buildKnowledgeBase({
sources: [
{ type: 'confluence', space: 'ENG' },
{ type: 'notion', database: 'company-policies' },
{ type: 'slack', channels: ['#engineering', '#product'] },
],
});
// Make it searchable
app.get('/search', async (req, res) => {
const { query } = req.query;
const results = await internalKB.search(query);
res.json(results);
});
Access control:
// Filter results based on user permissions
const chunks = await retrieveChunks(query, {
where: {
OR: [
{ accessLevel: 'public' },
{ accessLevel: 'employee' },
{ teams: { has: user.team } },
],
},
});
ROI: New employees get up to speed 3x faster with instant access to tribal knowledge.
Documentation Search
Use case: Developers need to search API docs, SDKs, and code examples.
Implementation:
// API documentation RAG
const apiDocs = await buildKnowledgeBase({
sources: [
{ type: 'openapi', spec: './api-spec.yaml' },
{ type: 'markdown', path: './docs/api' },
{ type: 'github', repo: 'your-org/sdk-examples' },
],
});
// Code-aware chunking
function chunkCode(code: string) {
// Split on function boundaries, not arbitrary characters
const functions = extractFunctions(code);
return functions.map(fn => ({
text: fn.code,
metadata: { type: 'function', name: fn.name, language: 'typescript' },
}));
}
Code search example:
User: "How do I paginate API results in Python?"
Retrieved chunk:
```python
# Pagination example
params = { 'page': 1, 'per_page': 50 }
while True:
response = api.get('/users', params=params)
users = response.json()
if not users:
break
process_users(users)
params['page'] += 1
Answer: "Here's how to paginate API results in Python..."
**ROI:** Developers find answers 5x faster than browsing docs manually.
---
## Part 9: The PipeCrush Implementation
Let's talk about how we actually built this at PipeCrush.
### Our Architecture
**Tech stack:**
- **Vector DB:** pgvector on NeonDB (PostgreSQL)
- **Embeddings:** OpenAI text-embedding-3-small
- **LLM:** GPT-4 Turbo (streaming)
- **Framework:** Next.js with React
- **Hosting:** Vercel (frontend) + NeonDB (database)
**Why these choices:**
1. **pgvector:** We already had Postgres for our operational data. Adding pgvector meant zero new infrastructure.
2. **OpenAI embeddings:** At $0.02/1M tokens, the cost is negligible. We process ~5M tokens/month = $0.10/month.
3. **GPT-4 Turbo:** Best quality for RAG. We tried GPT-3.5, but it struggled with complex technical questions. GPT-4 is worth the 10x cost.
4. **Next.js:** Our entire app is Next.js. Keeping the RAG system in the same codebase simplifies deployment.
**Database schema:**
```prisma
model KnowledgeChunk {
id String @id @default(cuid())
content String
embedding Unsupported("vector(1536)")
metadata Json
customerId String // Multi-tenant: each customer has their own KB
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([customerId])
@@index([embedding(ops: vector_cosine_ops)], type: Hnsw)
}
model ChatMessage {
id String @id @default(cuid())
role String // 'user' | 'assistant'
content String
metadata Json // Retrieved chunks, confidence score
threadId String
customerId String
createdAt DateTime @default(now())
@@index([threadId])
@@index([customerId])
}
Multi-tenancy:
Each customer gets their own knowledge base. When ingesting docs:
await prisma.knowledgeChunk.create({
data: {
content: chunk.text,
embedding: embedding,
customerId: user.customerId, // Isolated per customer
metadata: { ... },
},
});
When retrieving:
const chunks = await prisma.$queryRaw`
SELECT content, metadata, 1 - (embedding <=> ${queryEmbedding}::vector) as similarity
FROM "KnowledgeChunk"
WHERE customer_id = ${user.customerId} -- Critical: filter by customer
AND 1 - (embedding <=> ${queryEmbedding}::vector) > 0.7
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT 5
`;
This ensures customer A never sees customer B's documents.
Real Results
We launched RAG-powered support in November 2025. Here's what happened:
Ticket reduction:
- Before RAG: 87 tickets/week
- After RAG (4 weeks): 58 tickets/week
- 33% reduction in support volume
Response accuracy:
- Manual eval (100 queries): 91% accuracy
- User feedback (thumbs up/down): 4.2/5.0 average
Time saved:
- Average ticket resolution time: 12 minutes
- Tickets prevented per week: 29
- Support team time saved: 5.8 hours/week
Customer satisfaction:
- Before: Users waited 2-6 hours for support
- After: Instant answers for 33% of questions
- NPS improved from 42 to 53
Cost:
- OpenAI embeddings: $0.10/month
- OpenAI LLM calls: ~$45/month (at current usage)
- Total RAG cost: $45.10/month
ROI:
- Support engineer salary: ~$8,000/month
- Time saved: 5.8 hours/week = 23.2 hours/month = 14% of an FTE
- Value: $1,120/month for a $45/month investment
That's a 25x ROI. And it scales: as we grow, the bot handles more queries without additional support headcount.
Dogfooding
We use our own RAG chatbot internally:
Engineering:
- Searches our internal docs, architecture decision records (ADRs), and runbooks
- Answers questions like "How does our webhook retry logic work?"
Sales:
- Searches competitive intel, pricing guidelines, and case studies
- Answers questions like "What's our win rate against Intercom?"
Onboarding:
- New employees ask the bot about benefits, PTO policy, expense reports
- Reduces onboarding burden on HR and managers
Continuous improvement:
We track all internal queries and use them to:
- Identify gaps in our documentation
- Test new retrieval strategies
- Improve prompt templates
Every week, we review the "low confidence" responses and either:
- Improve the docs (add missing information)
- Improve the prompt (clarify instructions)
- Improve chunking (adjust chunk size or overlap)
This feedback loop is why our accuracy keeps improving.
Conclusion: The Future of AI Support
We're at an inflection point.
Traditional support = reactive. Customer has a problem, opens a ticket, waits for a human.
RAG-powered support = proactive. The AI detects the customer's struggle (stuck on a page for 2 minutes, error in console logs) and offers help before they ask.
Here's where this is headed:
1. Embedded support
The chatbot won't live in a separate widget. It'll be embedded in your product:
- User hovers over a confusing button → tooltip appears with context from your docs
- User gets an error → bot auto-suggests the fix based on error code + user context
- User opens a complex form → bot walks them through each field
2. Personalized knowledge
Current RAG: Same answers for everyone.
Future RAG: Answers personalized to your role, usage patterns, and history:
User A (Admin): "How do I add users?"
→ "Go to Settings > Team > Invite Users. You can bulk upload via CSV."
User B (Regular user): "How do I add users?"
→ "You'll need admin permissions. Would you like me to notify your workspace admin?"
The same question, different answers based on who's asking.
3. Multi-modal RAG
Current RAG: Text only.
Future RAG: Searches images, videos, diagrams, code:
User: "How do I set up OAuth?"
→ Returns: Text explanation + video tutorial + code snippet + architecture diagram
4. Agentic workflows
Current RAG: Answers questions.
Future RAG: Takes actions:
User: "Why isn't my campaign sending?"
Bot: "I checked your campaign. The issue is your email domain isn't verified. Would you like me to start the verification process?"
User: "Yes"
Bot: "I've sent a verification email to your domain admin. I'll notify you when it's verified."
The bot doesn't just tell you what's wrong—it fixes it.
Summary of Key Points
RAG solves the hallucination problem by grounding LLM responses in your actual documentation.
The RAG pipeline: Query → Embed → Retrieve → Augment → Generate
Core components:
- Embeddings: Text becomes vectors (use OpenAI text-embedding-3-small)
- Vector database: Store and search embeddings (use pgvector for most SaaS companies)
- Chunking: Break docs into 200-500 word chunks with 10% overlap
- Retrieval: Combine semantic search + keyword search (hybrid search)
- Prompt engineering: System prompt + retrieved context + user query
- Multi-turn: Use conversation history to resolve references
Production considerations:
- Chunk size is the most important decision (test with your data)
- Hybrid search adds 15% accuracy over pure semantic search
- Stream LLM responses for better UX
- Track retrieval accuracy with manual evals
- Re-rank results by recency and authority
- Always cite sources in bot responses
ROI: We reduced support tickets by 33% at a cost of $45/month.
Implementation Roadmap
Week 1: Foundation
- Set up pgvector in your PostgreSQL database
- Create knowledge base schema
- Write document ingestion script
Week 2: Ingestion
- Ingest your docs (start with 20-50 articles)
- Experiment with chunk sizes (test 200, 500, 1000)
- Verify embeddings are stored correctly
Week 3: Retrieval
- Build vector search API
- Implement hybrid search (semantic + keyword)
- Test retrieval accuracy on 20 common questions
Week 4: Generation
- Write RAG prompt template
- Integrate OpenAI GPT-4 Turbo
- Implement streaming responses
Week 5: Polish
- Add conversation memory (multi-turn)
- Implement confidence scoring
- Add source citations
Week 6: Launch
- Deploy to production
- Monitor accuracy and user feedback
- Iterate based on "not helpful" responses
Total time: 6 weeks for one engineer to go from zero to production RAG.
Getting Started with PipeCrush
We've built all of this into PipeCrush so you don't have to.
Our AI chatbot includes:
- Pre-built RAG pipeline (just upload your docs)
- Hybrid search out of the box
- Multi-turn conversation memory
- Analytics dashboard (see which questions are asked most)
- Seamless integration with your existing support automation and knowledge base
You can train the bot on:
- Your product documentation
- Help center articles
- API references
- Internal wikis
- Previous support tickets
It plugs directly into your CRM to access customer context, your customer management system for ticket history, and your unified inbox for seamless escalation to humans when needed.
Pricing: Starting at $49/month (includes unlimited knowledge base docs, 1,000 bot conversations/month, and full analytics).
Start your 14-day free trial →
Related Resources
Hub Articles:
- The Modern Revenue Stack: How to Unify Sales, Marketing, and Support
- Cold Email Infrastructure: The Engineering Guide to Deliverability
Product Pages:
- AI Sales Chatbot: Qualify Leads and Book Meetings
- AI Support Chatbot: Resolve Tickets Automatically
- Knowledge Base: Train Your AI Chatbot
- CRM: Unified Customer Data
FAQ Section
Q: How accurate is RAG compared to fine-tuning an LLM?
RAG is more accurate for domain-specific knowledge because it retrieves exact information from your docs. Fine-tuning teaches the LLM patterns but doesn't guarantee factual accuracy. Plus, RAG is easier to update—just add new docs instead of retraining the model.
Q: What's the biggest challenge in implementing RAG?
Chunking strategy. If your chunks are too small, you lose context. Too large, and retrieval becomes noisy. Start with 500-character chunks with 10% overlap, then tune based on your evaluation metrics.
Q: Can RAG handle multi-language support?
Yes. Use multilingual embedding models like Cohere embed-multilingual or OpenAI's text-embedding-3-large. Store docs in multiple languages and filter retrieval by user language preference.
Q: How do I prevent the bot from hallucinating?
Two strategies: (1) Use a strict system prompt that says "Only answer based on the provided context. If you don't know, say so." (2) Implement confidence scoring and flag low-confidence responses for human review.
Q: What's the cost of running RAG at scale?
For a typical B2B SaaS company with 10,000 knowledge base chunks and 1,000 queries/month: Embedding costs ~$2/month, LLM costs ~$50/month. Total: $52/month. This scales linearly with query volume.
Q: How does RAG handle outdated documentation?
Store a lastUpdated timestamp with each chunk. When retrieving, either filter out docs older than N months or use recency as a ranking signal. Also, set up webhooks to re-ingest docs when they change.
Q: Can I use RAG with GPT-3.5 to save costs?
Yes, but expect lower accuracy. GPT-3.5 struggles with complex technical questions and is more prone to hallucination. For production support chatbots, GPT-4 Turbo is worth the 10x cost difference.
