1. Indexing

Indexing prepares the knowledge base for efficient retrieval by processing and storing documents in a structured format.

Load

  • Purpose: Ingest raw data from various sources and normalize it for further processing.
  • Common Data Sources:
    • Structured: Databases, APIs
    • Semi-structured: JSON, XML, CSV
    • Unstructured: PDFs, DOCX, HTML, scraped web pages
  • Best Practices:
    • Extract metadata (e.g., author, timestamps) to enhance retrieval.
    • Normalize text (remove special characters, fix encoding issues).
    • Detect language and preprocess accordingly (stemming, lemmatization).

Split

  • Purpose: Break documents into manageable chunks for improved retrieval accuracy.
  • Splitting Strategies:
    • Fixed-size chunking (e.g., 512 tokens per chunk).
    • Overlapping chunks (ensures context continuity).
    • Semantic chunking (splitting based on topic shifts or sentence boundaries).
  • Best Practices:
    • Ensure chunks are not too large (to avoid irrelevant retrieval) or too small (to retain context).
    • Use metadata tags to track chunk origin and relationships.

Documents chunking strategies and best practices

Store

  • Purpose: Convert text into vector embeddings and store them in a vector database for efficient similarity search.
  • Key Components:
    • Embeddings Model: OpenAI, Cohere, Sentence-BERT, etc.
    • Vector Store: FAISS, Pinecone, Weaviate, ChromaDB, Qdrant.
  • Best Practices:
    • Use dense embeddings (transformer-based models) for semantic search.
    • Enable hybrid search (combine vector-based and keyword-based search).
    • Perform periodic re-indexing to keep data up to date.

How to choose a vector database?

Commonly used Embedding Models

2. Retrieval and Generation

This stage retrieves relevant information and generates context-aware responses.

Retrieve

  • Purpose: Fetch the most relevant documents from the indexed knowledge base.
  • Retrieval Strategies:
    1. Dense Retrieval (Vector Search)
      • Uses embeddings and similarity search (e.g., cosine similarity, dot product).
      • Works well for semantic understanding of queries.
    2. Sparse Retrieval (Keyword Search)
      • Uses traditional NLP-based methods (e.g., BM25, TF-IDF).
      • Effective for precise keyword-based matching.
    3. Hybrid Retrieval (Dense + Sparse)
      • Combines keyword-based and vector search for more robust results.
      • Example: Re-rank BM25 results using embeddings.
    4. Cross-encoder Re-ranking
      • Uses an additional transformer model to re-rank retrieved documents.
      • Ensures higher relevance before passing results to the LLM.
  • Best Practices:
    • Use query expansion (synonyms, paraphrasing) to improve recall.
    • Implement multi-stage retrieval (fast initial retrieval → fine-grained re-ranking).
    • Optimize for latency vs. accuracy (re-ranking adds computation time).

Understanding Vector Search and Distance Metrics in Vector Search

Generate

  • Purpose: Generate a context-aware response by combining retrieved information with the user query.
  • Steps:
    1. Construct a prompt that integrates retrieved documents.
    2. Pass the prompt to the LLM for response generation.
    3. Apply post-processing (hallucination detection, formatting).
  • Best Practices:
    • Use prompt engineering (e.g., retrieval-augmented templates).
    • Implement response validation (checking for factual consistency).
    • Provide citations (linking generated content back to sources).

Advanced Prompt Engineering

Commonly used Embedding Models

ModelOrganizationDimensionsContext LengthLink
text-embedding-3-largeOpenAI30728191OpenAI Documentation
text-embedding-3-smallOpenAI15368191OpenAI Documentation
text-embedding-ada-002OpenAI15368191OpenAI Documentation
sentence-transformersHugging FaceVariesVariesGitHub
BERTGoogle768/1024512GitHub
E5Microsoft1024512GitHub
GTEAlibaba768512GitHub
BGEBAAI768/1024512GitHub
INSTRUCTORHugging Face768512GitHub
SGPTHugging Face768/1024512GitHub
MPNetMicrosoft768512Hugging Face
SBERTUKP Lab768512GitHub
SimCSEPrinceton NLP768512GitHub
CLIPOpenAI51277GitHub
Claude EmbeddingsAnthropic1024/122888K/200KAnthropic Documentation
Cohere EmbeddingsCohere1024/4096512/8192Cohere Documentation