Skip to main content

Command Palette

Search for a command to run...

Semantic Search: How Meaning-Based Retrieval Works

Published
4 min read

Semantic search is the retrieval mechanism in RAG. Getting it right is the difference between an AI assistant that finds the correct information and one that retrieves garbage and hands it to the LLM.

The Fundamental Problem with Keyword Search

BM25, the algorithm behind most traditional search engines, works by scoring documents based on how often query terms appear in them, adjusted for document length and collection-wide term frequency. It is well-engineered and fast. It also completely misses conceptual similarity.

Consider a legal document retrieval system. A lawyer searches for "breach of confidentiality obligations". The relevant case law uses the phrase "violation of non-disclosure terms". BM25 scores them as completely unrelated. The lawyer does not find the relevant precedent.

Semantic search resolves this because "breach" and "violation" live nearby in embedding space, as do "confidentiality obligations" and "non-disclosure terms".

The Complete Semantic Search Architecture

Indexing Pipeline (one-time or periodic):

  1. Load documents from your source (PDFs, databases, web pages)

  2. Split into chunks (discussed in the chunking video)

  3. Embed each chunk using an embedding model

  4. Store the chunk text and its embedding in a vector database

  5. Optionally store metadata (document title, date, author) alongside each chunk

Query Pipeline (per request):

  1. Receive user query

  2. Embed the query using the same embedding model

  3. Execute approximate nearest neighbor search in the vector database

  4. Retrieve top-k chunks by cosine similarity

  5. Return chunks to the RAG pipeline for context assembly

The critical constraint: indexing and querying must use the same embedding model. Embeddings from different models live in different vector spaces and are not comparable.

Approximate Nearest Neighbor Search

Finding the truly closest vectors in a database of one million embeddings would require computing the distance to every stored embedding. At 1536 dimensions per vector, that is computationally expensive.

ANN algorithms find the approximately nearest neighbors much faster by sacrificing a small amount of accuracy. The most common algorithms are:

HNSW (Hierarchical Navigable Small World): Builds a multi-layer graph structure. Search navigates from coarse to fine layers. Excellent recall with fast query times.

IVF (Inverted File Index): Partitions the vector space into clusters. Search examines only the most relevant clusters. Faster to build than HNSW, slightly lower recall.

In practice, you do not implement these yourself. Vector databases like Pinecone, Weaviate, Qdrant, and Chroma handle indexing internally.

Hybrid Search Implementation

Hybrid search combines dense (semantic) and sparse (keyword) retrieval:

Dense score: cosine similarity between query embedding and document embedding Sparse score: BM25 or TF-IDF score based on keyword overlap Final score: alpha * dense_score + (1 - alpha) * sparse_score

The alpha parameter controls the balance. alpha = 1.0 is pure semantic. alpha = 0.0 is pure keyword. Typical production values are around 0.7 to 0.8 for alpha, favoring semantic.

Reciprocal Rank Fusion (RRF) is another combination method that merges ranked lists rather than scores, making it more robust to score scale differences.

Failure Analysis

Failure Type 1 — Wrong embedding model: Using a general-purpose model for a specialized domain. Medical, legal, financial, and coding domains all benefit from domain-specific embedding models.

Failure Type 2 — Query-document style mismatch: Users ask questions in conversational language. Documents are written in formal technical language. The embedding distance is larger than it should be. Fix: HyDE (Hypothetical Document Embeddings) — generate a hypothetical answer to the query, embed that, and search with the answer embedding instead of the question embedding.

Failure Type 3 — Chunk granularity mismatch: If your chunks are too large, a chunk containing the relevant sentence also contains many irrelevant sentences. The embedding represents the average of all that content, diluting the signal. If chunks are too small, important context is lost.

Failure Type 4 — Reranking neglected: Top-k retrieval by cosine similarity is a coarse filter. A reranker model (cross-encoder) reads both the query and each candidate document together and produces a more accurate relevance score. Adding reranking consistently improves retrieval quality at moderate latency cost.

When Semantic Search is Unnecessary

If your documents are small and your queries are exact lookups — serial numbers, error codes, proper nouns — keyword search is faster and more accurate. Semantic search is for the cases where users express their intent in natural language and the relevant documents use different vocabulary.