Vectors and Embeddings: The Mathematical Foundation of AI Systems

Before you can build a RAG system, design a semantic search engine, or understand why vector databases exist, you need to understand embeddings. This is the concept that makes everything else work.

Why Text Must Become Numbers

Neural networks operate on numbers. Every layer is a matrix multiplication. There is no mechanism for processing the string "customer complaint" directly — it must be represented numerically first.

The naive approach is one-hot encoding: assign each word a unique integer and represent a sentence as a list of those integers. The problem is that this representation carries no semantic information. "Happy" and "Joyful" would have completely unrelated numbers. You could not compute similarity between them.

Embeddings solve this by representing text as a dense vector in a high-dimensional space, where position encodes meaning.

How Embedding Models Work

Embedding models are neural networks trained on large text corpora. The training objective teaches the model to produce similar vectors for texts that appear in similar contexts and dissimilar vectors for texts that do not.

The training process is not supervised in the traditional sense — there are no human-labeled similarity scores. Instead, techniques like contrastive learning or masked language modeling create a self-supervised signal that results in semantically meaningful vector spaces.

Popular embedding models:

text-embedding-ada-002 (OpenAI) — 1536 dimensions, widely used
text-embedding-3-large (OpenAI) — higher quality, more expensive
BGE models — open source, competitive quality
E5 models — strong for retrieval tasks
Sentence Transformers — flexible open source family

The Geometry of Meaning

The vector space produced by embedding models has remarkable geometric properties.

Semantic similarity: Words and phrases with similar meanings cluster together. Medical terms cluster. Legal terms cluster. Programming terms cluster.

Analogy structure: Relationships between concepts are encoded as vectors. The vector from "Paris" to "France" is approximately the same as the vector from "Tokyo" to "Japan". Geographic capital relationships are represented as a consistent geometric transformation.

Compositionality: You can combine embeddings. "Not happy" should produce an embedding that is directionally opposite to "happy" in the relevant semantic dimensions.

Cosine Similarity in Detail

Given two vectors A and B, cosine similarity is:

cosine_similarity(A, B) = (A dot B) / (|A| * |B|)

The dot product measures how much the vectors point in the same direction. Dividing by the magnitudes normalizes the result to remove the effect of vector length.

Why cosine rather than Euclidean distance? Cosine similarity is rotation-invariant and handles the high dimensionality of embedding spaces better. Two texts saying the same thing in different amounts of words will have vectors of different magnitudes but similar directions — cosine similarity captures this correctly.

Failure Modes

Out-of-domain text: A general embedding model trained on web content will produce poor embeddings for highly specialized text. A sentence from a patent filing may not embed accurately next to semantically similar sentences from different legal documents if the model has never seen that style of writing.

Short text: Very short inputs — a word or two — produce embeddings with high variance. There is not enough context for the model to produce a stable representation.

Multilingual issues: If your query is in Tamil and your documents are in English, a multilingual embedding model is required. Standard English-only models will not capture cross-language semantic similarity.

Context sensitivity: Standard word embedding models assign one fixed vector per word regardless of context. "Lead" (the metal) and "lead" (to guide) get the same embedding. Contextual models like BERT compute token-level embeddings that depend on the full sentence, resolving this ambiguity.

Practical Considerations

Embedding generation has a cost — both money and time. You pay per token for API-based embedding services.

Cache your embeddings. If the same text will be embedded multiple times, store the result and reuse it. We cover this in the caching video.

Batch your embedding requests. APIs like OpenAI accept arrays of strings in a single call. Calling the API once with 100 strings is far faster than 100 individual calls.

Choose embedding model and vector database dimensions together. If you switch embedding models, you must re-embed all your documents because the vector space changes completely.

The next concept builds directly on this: semantic search uses embeddings to find relevant documents by meaning rather than by keyword matching.

Vectors and Embeddings: The Mathematical Foundation of AI Systems

Comments

AI Fundamentals

What is an AI Engineer?

More from this blog

RAG: Retrieval Augmented Generation

Chunking: Why Document Splitting Determines RAG Quality

Vector Databases: Infrastructure for Semantic Search

Semantic Search: How Meaning-Based Retrieval Works

Command Palette

Comments

AI Fundamentals

What is an AI Engineer?

More from this blog