Over the past few months, I've built two LLMs from scratch, namely GPT-rs and llama2.rs. At this point, I feel reasonably comfortable with the transformer architecture and how it works.

One of the core operations within the layers of a neural net is matrix multiplication or matmul as it's often referred to. This matmul operation happens trillions of times within large language models (LLMs) like ChatGPT and is one the main reasons why ChatGPT is able to provide coherent (usually!) responses.

But I often find myself asking, how and why does multiplying matrices trillions of times lead to ChatGPT giving us a coherent response to our prompts?

It's one thing to implement matmul it's another to deeply understand why it works.

Let's start at the beginning and build our way to the intuition.

Vector Multiplication

The most fundamental component of matrix multiplication is the vector. A vector is a direction in space that has a magnitude. We use bracket notation to represent the vector in the form [x,y], where x and y are dimensions in space.

Throughout this post, I write vectors as [x, y] for readability, but they formally represent column vectors $\begin{bmatrix} x \\ y \end{bmatrix}$ .

For example, we can draw the vector, $v_1$ , at [0,1] on a coordinate plane:

alt

The vector starts at the origin point of [0,0] and goes to [0,1]. Since it only has two dimensions (x and y), we can easily draw it on a 2-dimensional coordinate plane. However, in machine learning, we often use vectors with hundreds or thousands of dimensions. As humans, we can't visualize this many dimensions but computers can easily work with them.

Why do we use so many dimensions? Because language is complex! Every dimension can represent a "feature" of a word. So a word might need dimensions for:

Sentiment (positive/negative)
Formality (casual/formal)
Tense (past/present/future)
Category (animal/object/action)
And hundreds more nuanced aspects of meaning

This is one of the key parameters that engineers tune when building models. Too many dimensions and the model becomes slow and prone to overfitting. Too few dimensions and it can't capture language's complexity. Modern LLMs typically use between 768 and 4096 dimensions per word.

The direction of the vector is the direction of the arrowhead and usually is written as the angle the vector makes with regard to the positive x-axis or a compass direction. Since our vector points straight up, it has a direction of 90 degrees relative to the x-axis.

The magnitude of the vector is the numerical size or length of the vector. Our vector has a magnitude of 1 since it starts at [0,0] and ends at [0,1].

I can add another vector, $v_2$ , here at [1,0] shown in orange:

alt

These two vectors, [0,1] and [1,0], are also called basis vectors. So if we want to combine these vectors, we take the weighted sum of the vectors, which we can represent using the equation:

a v_1 + b v_2

Where $a$ and $b$ are weights or $[a,b]$ is a weight vector (we'll see an example of this below). If we randomly set these weights to $a = 0.5$ , $b = 0.8$ and plug them into our equation, we get:

a v_1 + b v_2

0.5 \cdot [0,1] + 0.8 \cdot [1,0]

We can now multiply each element in the $v_1$ vector by $0.5$ :

a \cdot v_1 = 0.5 \cdot [0, 1] = [0.5 \cdot 0, 0.5 \cdot 1] = [0, 0.5]

So we took our original [0,1] basis vector and then multiplied it by the $x$ element in our weight vector and got [0,0.5]. We just shrunk our vector or, more precisely, reduced it's magnitude by 50%! This is the heart of linear algebra. It's about stretching, scaling and rotating vectors in some dimensional space.

We can also do the same thing for the $v_2$ [1,0] vector and get:

a \cdot v_2 = 0.8 \cdot [1, 0] = [0.8 \cdot 1, 0.8 \cdot 0] = [0.8,0]

This time, we only shrunk our vector in the x-direction by 20%.

Lastly, we can add these two vectors by adding the x position of each vector and then the y position of each vector like this:

\begin{align} a v_1 + b v_2 &= [0, 0.5] + [0.8,0]\\ &= [0 + 0.5, 0.8 + 0]\\ &= [0.5, 0.8] \end{align}

This creates a new vector that we can plot along side the original two basis vectors:

alt

So we've now taken the weighted sum of the two input vectors, $v_1$ and $v_2$ and created a new vector which is the linear combination of the two input vectors.

But what does this actually mean? How do we move from vectors to matrices?

From Vectors to Matrices

Matrices are collections of basis vectors (as columns) that have been transformed by some weights. For example, if we take our basis vectors from above,[0,1] and [1,0], we can transform them with:

M = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}

This tells us that:

The x-axis basis vector [1,0] is mapped to [2,1].
The y-axis basis vector [0,1] is mapped to [1,3].

Visually, the transformation looks like:

alt

An important point here: the vectors themselves aren't actually moving. The original coordinate grid gets stretched and rotated so that the new “x-axis” now points toward [2,1], and the new “y-axis” points toward [1,3]. The actual coordinate space that the vectors are in is stretching and rotating which causes the vectors to stretch and rotate as well.

What if we wanted to multiply this matrix by an input vector? Starting with our matrix and input vector:

M = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}

We can take each row of the input vector and multiply it by each column of the matrix. Here's how it goes:

\begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 3 \end{bmatrix} = \begin{bmatrix} 2\cdot2 + 1\cdot3 \\ 1\cdot2 + 3\cdot3 \end{bmatrix} = \begin{bmatrix} 7 \\ 11 \end{bmatrix}

Or, visually with color codes showing the steps:

alt

Each row of the matrix that we multiply with the input vector, we are performing the Dot Product of the matrix and the vector. So a matrix–vector multiplication is just two dot products, one per row of the matrix, producing a new vector.

Let's look at this visually:

alt

There are a few vectors here so let's go through them:

The solid blue vector is the input vector of $[2,3]$ . We can graph this in our coordinate plane starting from the origin $[0,0]$ and extending to $[2,3]$ .
The solid green vector is the 1st column of the matrix $M = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$ , which is $\begin{bmatrix} 2 \\ 1 \end{bmatrix}$
The dashed green vector is the 1st element of the input vector multiplied by the 1st column of the matrix, $2 \cdot [2,1] = [4,2]$
The solid orange vector is the 2nd column of the matrix $M = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$ , which is $\begin{bmatrix} 1 \\ 3 \end{bmatrix}$ .
The dashed orange vector is the 2nd element of the input vector multiplied by the 2st column of the matrix, $3 \cdot [1,3] = [3,9]$
The solid red vector is resulting vector when we add the dashed orange vector and the dashed green vector.

But wait! Why do we just have one result vector? I thought the matrix was a combination of vectors, so shouldn't we have two vectors? This is an excellent question and gets at the heart of matrix multiplication.

When you multiply a matrix by a vector, you get a vector. Why?

You get a vector because the matrix acts on the vector. That means each column of the matrix gets scaled by the corresponding element of the vector and then all of those scaled columns are added up to produce the weighted sum.

Think of it this way:

The matrix $M$ has two "basis directions" (its columns)
The matrix says to the input vector $[2, 3]$ "give me 2 units of the first direction and 3 units of the second direction"
You add those scaled directions together to get the result vector

By taking the dot product of each matrix and vector, we're stretching and rotating the input vector by the matrix in order to project it into a new space.

The cool thing is that this scales and works with matrices that have as many dimensions as you want!

Let's look at an example in the next section.

Multiplying Matrices

We learned how to multiply a matrix by a vector above. Now let's see what happens when we multiply two matrices together.

When you perform matrix multiplication, you're essentially doing a lot of dot products between rows and columns.

Let's start with a concrete example:

A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}

To compute $A \times B$ , we take each row of $A$ and dot it with each column of $B$ :

Top-left entry: Row 1 of $A$ · Column 1 of $B$

[1, 2] \cdot [5, 7] = 1 \cdot 5 + 2 \cdot 7 = 5 + 14 = 19

Top-right entry: Row 1 of $A$ · Column 2 of $B$

[1, 2] \cdot [6, 8] = 1 \cdot 6 + 2 \cdot 8 = 6 + 16 = 22

Bottom-left entry: Row 2 of $A$ · Column 1 of $B$

[3, 4] \cdot [5, 7] = 3 \cdot 5 + 4 \cdot 7 = 15 + 28 = 43

Bottom-right entry: Row 2 of $A$ · Column 2 of $B$

[3, 4] \cdot [6, 8] = 3 \cdot 6 + 4 \cdot 8 = 18 + 32 = 50

Putting it all together:

A \times B = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}

The General Pattern

For any two matrices, the pattern is the same:

A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, \quad B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}

Then:

A \times B = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \end{bmatrix}

Each entry in the output matrix is a dot product between:

a row of the first matrix, and
a column of the second matrix.

So, if (A) is a 2×3 matrix and (B) is a 3×2 matrix, the (i, j)th entry of (A \times B) is:

(A \cdot B)_{ij} = \text{dot}(A*_{i,*}, B_{*,j})

The dot product measures how "aligned" two vectors are:

High dot product → vectors point in similar directions (high similarity)
Low dot product → vectors are perpendicular (no similarity)
Negative dot product → vectors point in opposite directions

This is why we can use the dot product to measure similarity between word vectors in language models. When we multiply matrices in a neural network, we're essentially computing how aligned different feature directions are, which helps the model understand relationships between concepts.

Now we tie it all together.

From Matrices to Meaning

What if these vectors don't represent directions in space, but instead represent meaning in some abstract space?

In a large language model like ChatGPT, every word is represented as a vector, typically with hundreds or thousands of dimensions instead of just 2 that we saw above. When the model sees the word "king", it's not just storing the letters k-i-n-g, it's storing a vector like [0.23, -0.45, 0.87, ...] that captures the meaning of "king" based on how it's used in language. It learns this meaning as it's being trained on every book, article and piece of data on the internet and at the end of that training outputs out a vector like [0.23, -0.45, 0.87, ...].

But how does it capture the meaning of words like "king"? I thought you'd never ask.

Learning Meaning

Imagine we have a simple 2D "word space" (picture the 2D coordinate space from the earlier sections) where:

The first dimension represents royalty (0 = common, 1 = royal)
The second dimension represents femininity (0 = masculine, 1 = feminine)

So our mini vocabulary looks like this:

Word	Vector (royalty, femininity)
king	[0.9, 0.1]
queen	[0.9, 0.9]
man	[0.1, 0.1]
woman	[0.1, 0.9]

Now, if we do the simple vector arithmetic:

\text{king} - \text{man} + \text{woman} = [0.9, 0.1] - [0.1, 0.1] + [0.1, 0.9] = [0.9, 0.9]

That result [0.9, 0.9] corresponds to queen.

So what just happened?

We started with "king" (royal + masculine)
Subtracted "man" (removing masculinity)
Added "woman" (adding femininity)
Ended up with a royal + feminine concept = queen

This simple arithmetic works because each dimension encodes a semantic direction: “royalty” and “femininity.”
You can think of this as operating in a tiny semantic universe where vector directions represent concepts.

Scaling up

Inside ChatGPT (and all transformers), the same principle holds but across thousands of dimensions and billions of parameters.

Every word (or token) starts as a vector like:

\text{king} = [0.23, -0.45, 0.87, ...]

The model then applies a matrix multiplication:

h = x \cdot W

where

x = input vector (e.g., "king")
W = learned weight matrix
h = transformed vector (new representation)

Let's see a concrete example with small numbers. Suppose we have a 3-dimensional word vector and a 3×3 weight matrix:

x = [0.9, 0.1, 0.2], \quad W = \begin{bmatrix} 0.5 & 0.1 & 0.3 \\ 0.2 & 0.8 & 0.1 \\ 0.3 & 0.4 & 0.6 \end{bmatrix}

The transformed vector $h$ is:

h = x \cdot W = [0.9, 0.1, 0.2] \cdot \begin{bmatrix} 0.5 & 0.1 & 0.3 \\ 0.2 & 0.8 & 0.1 \\ 0.3 & 0.4 & 0.6 \end{bmatrix}

Computing each element:

h_1 = 0.9 \cdot 0.5 + 0.1 \cdot 0.2 + 0.2 \cdot 0.3 = 0.45 + 0.02 + 0.06 = 0.53

h_2 = 0.9 \cdot 0.1 + 0.1 \cdot 0.8 + 0.2 \cdot 0.4 = 0.09 + 0.08 + 0.08 = 0.25

h_3 = 0.9 \cdot 0.3 + 0.1 \cdot 0.1 + 0.2 \cdot 0.6 = 0.27 + 0.01 + 0.12 = 0.40

So our transformed vector is:

h = [0.53, 0.25, 0.40]

The original vector $[0.9, 0.1, 0.2]$ (high first dimension, low second) has been transformed into $[0.53, 0.25, 0.40]$ (more balanced across dimensions). The weight matrix learned to redistribute the information across different semantic directions.

Each column of $W$ represents a direction that the model has learned to recognize — things like "royalty," "gender," "plurality," "formality," or even more abstract ones like "causality" or "emotion."

Over billions of training examples, the model adjusts $W$ so that words used in similar contexts get pulled closer together (high dot product → high alignment), and unrelated words get pushed apart (low or negative dot product → orthogonal or opposing directions).

That's why "dog" and "cat" sit near each other in vector space, while "dog" and "keyboard" are far apart.
It's the same geometry as the "king − man + woman = queen" example just happening in thousands of dimensions and at massive scale.

But Why Matrix Multiplication?

By this point, hopefully you have a good understanding of how matrix multiplication is used in LLMs to understand words. Now, I think we can finally answer the question that I posed at the beginning of this blog: why does matrix multiplication work? Why not just add vectors together? Why not multiply them element-wise? Why not use some completely different operation?

There are three reasons why.

Preserves relationships

Matrix multiplication preserves relationships when transforming them. Remember our "king - man + woman = queen" example? That worked because the relationship between king and queen is the same as the relationship between man and woman.

Matrix multiplication can apply the same transformation to different vectors. If we have a matrix $M$ that represents "changing gender," it will transform both:

"king" → "queen"
"man" → "woman"
"prince" → "princess"

All with the same operation! That's the beauty of vector spaces. Let's see why this is impossible with element-wise operations.

Suppose we want to transform masculine words to feminine:

king  = [0.9, 0.1]  (royal, masculine)
queen = [0.9, 0.9]  (royal, feminine)

man   = [0.1, 0.1]  (common, masculine)
woman = [0.1, 0.9]  (common, feminine)

If we tried element-wise multiplication with some vector $[a, b]$ :

To get from king to queen: [0.9, 0.1] * [a, b] = [0.9, 0.9]
This means a = 1 and b = 9

But then for man → woman:

[0.1, 0.1] * [1, 9] = [0.1, 0.9] ✓ This works!

Great! Except... what if we want to also shift "prince" → "princess"?

Let's say that prince = [0.7, 0.1] (somewhat royal, masculine).

Using our [1, 9]:

[0.7, 0.1] * [1, 9] = [0.7, 0.9]

But that's wrong! We wanted the royalty to stay at 0.7. It's not like a princess is anymore royal than a prince. But the multiplication is ratio-based, not additive. We can't preserve the "add femininity" transformation across different levels of royalty.

Matrix multiplication solves this because it can do things like "keep the first dimension, change the second dimension by adding a fixed amount":

M = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0.8 & 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0.8 & 1 \end{bmatrix}

Now watch:

King: $\begin{bmatrix} 1 & 0 \\ 0.8 & 1 \end{bmatrix} \begin{bmatrix} 0.9 \\ 0.1 \end{bmatrix} = \begin{bmatrix} 0.9 \\ 0.82 \end{bmatrix}$
Man: $\begin{bmatrix} 1 & 0 \\ 0.8 & 1 \end{bmatrix} \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix} = \begin{bmatrix} 0.1 \\ 0.18 \end{bmatrix}$
Prince: $\begin{bmatrix} 1 & 0 \\ 0.8 & 1 \end{bmatrix} \begin{bmatrix} 0.7 \\ 0.1 \end{bmatrix} = \begin{bmatrix} 0.7 \\ 0.66 \end{bmatrix}$

The relationship is preserved across all inputs! This is the power of linear transformations.

It's Differentiable

I know we haven't covered neural networks but we need to jump forward a tiny bit. Neural networks learn through backpropagation - adjusting weights based on errors. The core backpropagation algorithm uses something called the Chain Rule (from calculus) to figure out how changing each weight in the neural network affects the final loss.

This is only possible because matrix multiplication is differentiable. This means that we can calculate exactly how changing each weight affects the output. Each column of the weight matrix gets updated based on how sensitive the output of that weight matrix is to that input.

Element-wise operations are differentiable too, but they're not as expressive as we saw in the first point above. More complex operations (like division or exponentiation) are either non-differentiable at certain points or computationally expensive.

It Enables Meaningful Linear Combinations

This is arguably the most important property. Matrix multiplication lets us say:

"The output should be a weighted combination of inputs, where the weights are learned from data."

For language, this is exactly what we need:

The meaning of "bank" is a weighted combination of "river" and "account" based on context (eyes emoji: Attention)
The next word in a sentence is a weighted combination of all previous words
The answer to a question is a weighted combination of facts in the model's training data

Addition can't do this (it treats everything equally). Element-wise multiplication can't do this (it can't mix dimensions).

Which leaves only matrix multiplication gives us learnable, flexible weighted combinations.

Wrapping Up

Matrix multiplication works because understanding language becomes a problem of placing and moving vectors in a space where relationships are linear, comparisons are dot products, and decisions are weighted sums. Matmul is the only operation that's minimal, trainable, and massively parallel enough to do all three at the scale required for intelligence.

In a way, transformers are like giant Rubik's cubes: rotating pieces into place, comparing to see if patterns align, shuffling when they don't, and repeating until something coherent emerges. Each twist of the cube is a matmul. Each alignment check is a dot product. Each shuffle is a weighted combination.

I wanted to keep this blog focused on matrix multiplication because it's the fundamental building block, but there are other critical pieces that make ChatGPT work. Without activation functions like ReLU or GELU between matmuls, the entire network would collapse into a single linear transformation - no matter how many layers you stack. The nonlinearities are what give neural networks their power to learn complex patterns.

Maybe I'll tackle activation functions next...

Until then!

Evis