Understanding Vector Search and Distance Metrics in Vector Search

Retrieval Strategies

Dense Retrieval

Dense retrieval uses continuous, fixed-dimensional vectors (embeddings) to represent documents and queries. These dense vectors capture semantic meaning and contextual information.

Key Features:

Represents text as dense vectors (typically 768-4096 dimensions)
Captures semantic relationships well
Uses neural models like BERT, Sentence-BERT, or OpenAI’s text-embedding models
Excels at understanding synonyms and related concepts

Advantages:

Strong semantic understanding
Handles paraphrasing effectively
Good at capturing document meaning
Works well for concept-based queries

Disadvantages:

May miss exact keyword matches
Computationally intensive
Requires training data
Can struggle with rare terms or highly specialized vocabulary

Example Use Case: When a user searches for “heart problems,” dense retrieval can return documents about “cardiac issues” or “coronary artery disease” even if they don’t contain the exact search terms.

Sparse Retrieval

Sparse retrieval represents documents using high-dimensional, sparse vectors where most values are zero. Traditional methods like TF-IDF and BM25 fall into this category.

Key Features:

Relies on lexical matching (exact or stemmed word matches)
Creates sparse, high-dimensional vectors (vocabulary size)
Uses traditional models like BM25, TF-IDF
Focuses on term frequency and importance

Advantages:

Excellent at exact matching
No training required
Handles rare words well
Computationally efficient

Disadvantages:

Misses semantic relationships
Cannot understand synonyms
Sensitive to vocabulary mismatch
Struggles with polysemy (words with multiple meanings)

Example Use Case: When searching for specific technical terms, product names, or error codes, sparse retrieval ensures exact matches are prioritized.

Hybrid Retrieval

Hybrid retrieval combines sparse and dense methods to leverage the strengths of both approaches, aiming for both semantic understanding and lexical precision.

Key Features:

Combines signals from both sparse and dense retrievers
Can use methods like linear combination, late fusion, or learned fusion
May employ different weighting schemes for different query types
Often includes a classification step to determine query intent

Advantages:

Better overall performance than either method alone
More robust to different query types
Balances precision and recall
Handles both semantic and exact matching needs

Disadvantages:

More complex to implement and maintain
Requires tuning of combination weights
Potentially higher computational cost
May need additional training for optimal fusion

Example Use Case: Enterprise search systems often employ hybrid approaches to handle both exact product number searches and conceptual queries about product features or use cases.

Reranking

Reranking is a two-stage process where an initial retrieval step returns a candidate set of documents, which are then reordered using a more sophisticated (and often more computationally expensive) model.

Key Features:

Two-stage pipeline: retrieval → reranking
Initial retrieval is optimized for recall
Reranker focuses on precision and relevance
Often uses cross-attention between query and document

Advantages:

Balances efficiency and effectiveness
Allows use of complex models only on a small candidate set
Can incorporate additional signals (click data, freshness, etc.)
Often achieves state-of-the-art performance

Disadvantages:

Adds system complexity
Introduces additional latency
Limited by initial retriever’s recall
May require separate training for retriever and reranker

Example Use Case: In legal search, a first-pass retrieval might return hundreds of potentially relevant cases, which a reranker then prioritizes based on detailed analysis of legal arguments and precedents.

Distance Metrics

Manhattan Distance (L1 Norm)

Manhattan distance (also called taxicab or L1 distance) measures the sum of absolute differences between vector dimensions.

Mathematical Definition:

L1(x, y) = Σ|x_i - y_i|

Key Features:

Sensitive to differences in all dimensions
Less influenced by outliers than Euclidean distance
Distance increases linearly with difference in each dimension
Named after grid-like Manhattan street layout

Advantages:

Simple to calculate and interpret
Works well for sparse, high-dimensional data
Less sensitive to outliers than L2
Good for feature selection

Disadvantages:

Not invariant to scale
Distance is dependent on coordinate system
May not capture semantic relationships well
Not ideal for dense embeddings in NLP

Example Use Case: Manhattan distance is often used in geographic applications, logistics, and some recommendation systems where differences in each feature should be weighted equally.

Euclidean Distance (L2 Norm)

Euclidean distance measures the straight-line distance between two points in Euclidean space.

Mathematical Definition:

L2(x, y) = √(Σ(x_i - y_i)²)

Key Features:

Represents physical distance in space
Sensitive to magnitude differences
Squares differences, increasing impact of large differences
Most intuitive distance metric for humans

Advantages:

Intuitive geometric interpretation
Works well for low-dimensional data
Suitable when features are on similar scales
Standard metric for many clustering algorithms

Disadvantages:

Sensitive to outliers
Not scale-invariant
Performance degrades in high dimensions (curse of dimensionality)
May overemphasize large differences in specific dimensions

Example Use Case: Image similarity, especially when comparing visual features or when working with dimensionality-reduced data (like PCA components).

Cosine Similarity (and Distance)

Cosine similarity measures the cosine of the angle between two vectors, focusing on direction rather than magnitude.

Mathematical Definition:

Cosine Similarity(x, y) = (x·y)/(||x||·||y||)
Cosine Distance = 1 - Cosine Similarity

Key Features:

Measures angle between vectors (orientation)
Ranges from -1 (opposite) to 1 (same direction)
Independent of vector magnitude
Most common for NLP embeddings

Advantages:

Invariant to scaling
Works well for high-dimensional data
Ideal for text similarity where document length shouldn’t matter
Standard for most embedding models

Disadvantages:

Doesn’t account for magnitude differences
Not a true distance metric (doesn’t satisfy triangle inequality)
May not be suitable when absolute values matter
Less intuitive interpretation than Euclidean distance

Example Use Case: Document similarity, semantic search, and most neural text embeddings use cosine similarity as their primary comparison metric.

Dot Product

The dot product is the sum of the products of corresponding vector components, measuring both direction and magnitude alignment.

Mathematical Definition:

Dot Product(x, y) = Σ(x_i · y_i)

Key Features:

Affected by both angle and magnitude
Unbounded range (depends on vector magnitudes)
Can be negative, zero, or positive
Computationally efficient

Advantages:

Very fast to compute
Used in many machine learning algorithms
Good when both direction and magnitude matter
Easily optimized in hardware

Disadvantages:

Not normalized (affected by vector length)
Results not comparable across different vector pairs
No fixed range makes interpretation challenging
Need for similar vector magnitudes to be meaningful

Example Use Case: Often used in recommendation systems, attention mechanisms in transformers, and as an intermediate calculation in many algorithms before normalization.

Choosing the Right Approach

The optimal retrieval strategy and similarity function depend on your specific use case:

Use Case	Recommended Retrieval	Recommended Similarity
General semantic search	Dense or Hybrid	Cosine Similarity
Technical documentation	Hybrid with reranking	Cosine + BM25
Product search	Hybrid	Depends on embeddings
Question answering	Dense with reranking	Cosine Similarity
Code search	Sparse + Dense hybrid	Manhattan or Cosine
Legal document search	Hybrid with reranking	Multiple metrics

Remember that the best approach is often determined through experimentation and evaluation on your specific data and user queries.

Retrieval Strategies#

Dense Retrieval#

Sparse Retrieval#

Hybrid Retrieval#

Reranking#

Distance Metrics#

Manhattan Distance (L1 Norm)#

Euclidean Distance (L2 Norm)#

Cosine Similarity (and Distance)#

Dot Product#

Choosing the Right Approach#

Retrieval Strategies

Dense Retrieval

Sparse Retrieval

Hybrid Retrieval

Reranking

Distance Metrics

Manhattan Distance (L1 Norm)

Euclidean Distance (L2 Norm)

Cosine Similarity (and Distance)

Dot Product

Choosing the Right Approach