Retrieval Augmented Generation (RAG) Pipeline

1. Indexing

Indexing prepares the knowledge base for efficient retrieval by processing and storing documents in a structured format.

Purpose: Ingest raw data from various sources and normalize it for further processing.
Common Data Sources:
- Structured: Databases, APIs
- Semi-structured: JSON, XML, CSV
- Unstructured: PDFs, DOCX, HTML, scraped web pages
Best Practices:
- Extract metadata (e.g., author, timestamps) to enhance retrieval.
- Normalize text (remove special characters, fix encoding issues).
- Detect language and preprocess accordingly (stemming, lemmatization).

Purpose: Break documents into manageable chunks for improved retrieval accuracy.
Splitting Strategies:
- Fixed-size chunking (e.g., 512 tokens per chunk).
- Overlapping chunks (ensures context continuity).
- Semantic chunking (splitting based on topic shifts or sentence boundaries).
Best Practices:
- Ensure chunks are not too large (to avoid irrelevant retrieval) or too small (to retain context).
- Use metadata tags to track chunk origin and relationships.

Documents chunking strategies and best practices

Purpose: Convert text into vector embeddings and store them in a vector database for efficient similarity search.
Key Components:
- Embeddings Model: OpenAI, Cohere, Sentence-BERT, etc.
- Vector Store: FAISS, Pinecone, Weaviate, ChromaDB, Qdrant.
Best Practices:
- Use dense embeddings (transformer-based models) for semantic search.
- Enable hybrid search (combine vector-based and keyword-based search).
- Perform periodic re-indexing to keep data up to date.

How to choose a vector database?

Commonly used Embedding Models

This stage retrieves relevant information and generates context-aware responses.

Purpose: Fetch the most relevant documents from the indexed knowledge base.
Retrieval Strategies:
1. Dense Retrieval (Vector Search)
  - Uses embeddings and similarity search (e.g., cosine similarity, dot product).
  - Works well for semantic understanding of queries.
2. Sparse Retrieval (Keyword Search)
  - Uses traditional NLP-based methods (e.g., BM25, TF-IDF).
  - Effective for precise keyword-based matching.
3. Hybrid Retrieval (Dense + Sparse)
  - Combines keyword-based and vector search for more robust results.
  - Example: Re-rank BM25 results using embeddings.
4. Cross-encoder Re-ranking
  - Uses an additional transformer model to re-rank retrieved documents.
  - Ensures higher relevance before passing results to the LLM.
Best Practices:
- Use query expansion (synonyms, paraphrasing) to improve recall.
- Implement multi-stage retrieval (fast initial retrieval → fine-grained re-ranking).
- Optimize for latency vs. accuracy (re-ranking adds computation time).

Understanding Vector Search and Distance Metrics in Vector Search

Purpose: Generate a context-aware response by combining retrieved information with the user query.
Steps:
1. Construct a prompt that integrates retrieved documents.
2. Pass the prompt to the LLM for response generation.
3. Apply post-processing (hallucination detection, formatting).
Best Practices:
- Use prompt engineering (e.g., retrieval-augmented templates).
- Implement response validation (checking for factual consistency).
- Provide citations (linking generated content back to sources).

Advanced Prompt Engineering

Model	Organization	Dimensions	Context Length	Link
text-embedding-3-large	OpenAI	3072	8191	OpenAI Documentation
text-embedding-3-small	OpenAI	1536	8191	OpenAI Documentation
text-embedding-ada-002	OpenAI	1536	8191	OpenAI Documentation
sentence-transformers	Hugging Face	Varies	Varies	GitHub
BERT	Google	768/1024	512	GitHub
E5	Microsoft	1024	512	GitHub
GTE	Alibaba	768	512	GitHub
BGE	BAAI	768/1024	512	GitHub
INSTRUCTOR	Hugging Face	768	512	GitHub
SGPT	Hugging Face	768/1024	512	GitHub
MPNet	Microsoft	768	512	Hugging Face
SBERT	UKP Lab	768	512	GitHub
SimCSE	Princeton NLP	768	512	GitHub
CLIP	OpenAI	512	77	GitHub
Claude Embeddings	Anthropic	1024/12288	8K/200K	Anthropic Documentation
Cohere Embeddings	Cohere	1024/4096	512/8192	Cohere Documentation