Retrieval Strategies

Dense Retrieval

Dense retrieval uses continuous, fixed-dimensional vectors (embeddings) to represent documents and queries. These dense vectors capture semantic meaning and contextual information.

Key Features:

  • Represents text as dense vectors (typically 768-4096 dimensions)
  • Captures semantic relationships well
  • Uses neural models like BERT, Sentence-BERT, or OpenAI’s text-embedding models
  • Excels at understanding synonyms and related concepts

Advantages:

  • Strong semantic understanding
  • Handles paraphrasing effectively
  • Good at capturing document meaning
  • Works well for concept-based queries

Disadvantages:

  • May miss exact keyword matches
  • Computationally intensive
  • Requires training data
  • Can struggle with rare terms or highly specialized vocabulary

Example Use Case: When a user searches for “heart problems,” dense retrieval can return documents about “cardiac issues” or “coronary artery disease” even if they don’t contain the exact search terms.

Sparse Retrieval

Sparse retrieval represents documents using high-dimensional, sparse vectors where most values are zero. Traditional methods like TF-IDF and BM25 fall into this category.

Key Features:

  • Relies on lexical matching (exact or stemmed word matches)
  • Creates sparse, high-dimensional vectors (vocabulary size)
  • Uses traditional models like BM25, TF-IDF
  • Focuses on term frequency and importance

Advantages:

  • Excellent at exact matching
  • No training required
  • Handles rare words well
  • Computationally efficient

Disadvantages:

  • Misses semantic relationships
  • Cannot understand synonyms
  • Sensitive to vocabulary mismatch
  • Struggles with polysemy (words with multiple meanings)

Example Use Case: When searching for specific technical terms, product names, or error codes, sparse retrieval ensures exact matches are prioritized.

Hybrid Retrieval

Hybrid retrieval combines sparse and dense methods to leverage the strengths of both approaches, aiming for both semantic understanding and lexical precision.

Key Features:

  • Combines signals from both sparse and dense retrievers
  • Can use methods like linear combination, late fusion, or learned fusion
  • May employ different weighting schemes for different query types
  • Often includes a classification step to determine query intent

Advantages:

  • Better overall performance than either method alone
  • More robust to different query types
  • Balances precision and recall
  • Handles both semantic and exact matching needs

Disadvantages:

  • More complex to implement and maintain
  • Requires tuning of combination weights
  • Potentially higher computational cost
  • May need additional training for optimal fusion

Example Use Case: Enterprise search systems often employ hybrid approaches to handle both exact product number searches and conceptual queries about product features or use cases.

Reranking

Reranking is a two-stage process where an initial retrieval step returns a candidate set of documents, which are then reordered using a more sophisticated (and often more computationally expensive) model.

Key Features:

  • Two-stage pipeline: retrieval → reranking
  • Initial retrieval is optimized for recall
  • Reranker focuses on precision and relevance
  • Often uses cross-attention between query and document

Advantages:

  • Balances efficiency and effectiveness
  • Allows use of complex models only on a small candidate set
  • Can incorporate additional signals (click data, freshness, etc.)
  • Often achieves state-of-the-art performance

Disadvantages:

  • Adds system complexity
  • Introduces additional latency
  • Limited by initial retriever’s recall
  • May require separate training for retriever and reranker

Example Use Case: In legal search, a first-pass retrieval might return hundreds of potentially relevant cases, which a reranker then prioritizes based on detailed analysis of legal arguments and precedents.

Distance Metrics

Manhattan Distance (L1 Norm)

Manhattan distance (also called taxicab or L1 distance) measures the sum of absolute differences between vector dimensions.

Mathematical Definition:

L1(x, y) = Σ|x_i - y_i|

Key Features:

  • Sensitive to differences in all dimensions
  • Less influenced by outliers than Euclidean distance
  • Distance increases linearly with difference in each dimension
  • Named after grid-like Manhattan street layout

Advantages:

  • Simple to calculate and interpret
  • Works well for sparse, high-dimensional data
  • Less sensitive to outliers than L2
  • Good for feature selection

Disadvantages:

  • Not invariant to scale
  • Distance is dependent on coordinate system
  • May not capture semantic relationships well
  • Not ideal for dense embeddings in NLP

Example Use Case: Manhattan distance is often used in geographic applications, logistics, and some recommendation systems where differences in each feature should be weighted equally.

Euclidean Distance (L2 Norm)

Euclidean distance measures the straight-line distance between two points in Euclidean space.

Mathematical Definition:

L2(x, y) = √(Σ(x_i - y_i)²)

Key Features:

  • Represents physical distance in space
  • Sensitive to magnitude differences
  • Squares differences, increasing impact of large differences
  • Most intuitive distance metric for humans

Advantages:

  • Intuitive geometric interpretation
  • Works well for low-dimensional data
  • Suitable when features are on similar scales
  • Standard metric for many clustering algorithms

Disadvantages:

  • Sensitive to outliers
  • Not scale-invariant
  • Performance degrades in high dimensions (curse of dimensionality)
  • May overemphasize large differences in specific dimensions

Example Use Case: Image similarity, especially when comparing visual features or when working with dimensionality-reduced data (like PCA components).

Cosine Similarity (and Distance)

Cosine similarity measures the cosine of the angle between two vectors, focusing on direction rather than magnitude.

Mathematical Definition:

Cosine Similarity(x, y) = (x·y)/(||x||·||y||)
Cosine Distance = 1 - Cosine Similarity

Key Features:

  • Measures angle between vectors (orientation)
  • Ranges from -1 (opposite) to 1 (same direction)
  • Independent of vector magnitude
  • Most common for NLP embeddings

Advantages:

  • Invariant to scaling
  • Works well for high-dimensional data
  • Ideal for text similarity where document length shouldn’t matter
  • Standard for most embedding models

Disadvantages:

  • Doesn’t account for magnitude differences
  • Not a true distance metric (doesn’t satisfy triangle inequality)
  • May not be suitable when absolute values matter
  • Less intuitive interpretation than Euclidean distance

Example Use Case: Document similarity, semantic search, and most neural text embeddings use cosine similarity as their primary comparison metric.

Dot Product

The dot product is the sum of the products of corresponding vector components, measuring both direction and magnitude alignment.

Mathematical Definition:

Dot Product(x, y) = Σ(x_i · y_i)

Key Features:

  • Affected by both angle and magnitude
  • Unbounded range (depends on vector magnitudes)
  • Can be negative, zero, or positive
  • Computationally efficient

Advantages:

  • Very fast to compute
  • Used in many machine learning algorithms
  • Good when both direction and magnitude matter
  • Easily optimized in hardware

Disadvantages:

  • Not normalized (affected by vector length)
  • Results not comparable across different vector pairs
  • No fixed range makes interpretation challenging
  • Need for similar vector magnitudes to be meaningful

Example Use Case: Often used in recommendation systems, attention mechanisms in transformers, and as an intermediate calculation in many algorithms before normalization.

Choosing the Right Approach

The optimal retrieval strategy and similarity function depend on your specific use case:

Use CaseRecommended RetrievalRecommended Similarity
General semantic searchDense or HybridCosine Similarity
Technical documentationHybrid with rerankingCosine + BM25
Product searchHybridDepends on embeddings
Question answeringDense with rerankingCosine Similarity
Code searchSparse + Dense hybridManhattan or Cosine
Legal document searchHybrid with rerankingMultiple metrics

Remember that the best approach is often determined through experimentation and evaluation on your specific data and user queries.