Neural Networks¶
You cannot reason about inference cost without a mechanical picture of what a neural network does at runtime. This section builds that picture from the smallest unit up. If you already know what a matmul is and why ReLU exists, skim to the cost lens at the end — that framing is what the rest of the book leans on.
The node: a tiny program¶
The fundamental unit of a neural network is a node. A node is a very small program: it takes some input numbers, multiplies each by a weight, adds them up, adds a bias, and returns the result.
\"Node\" is the friendly name — here are the official ones
A single node is the artificial neuron, the basic unit of every neural network. Its original form — a single neuron that sums weighted inputs and applies a threshold — is the perceptron (Frank Rosenblatt, 1958), the term you'll meet in textbooks and papers. Modern neurons generalize the perceptron (smooth activations instead of a hard threshold), but it's the same idea. We say "node" because it's the simplest way to picture it; reach for neuron / perceptron when you read the literature.
- Weight — a learned number that says how much this input matters. Set during training, frozen during inference.
- Bias — a learned number added at the end, letting the node shift its output up or down independent of the input.
A worked example. Give the node three concrete inputs and its learned weights and bias:
x = [ 1.0, 2.0, 3.0 ] ← the inputs
w = [ 0.5, -1.0, 0.25] ← the learned weights (one per input)
b = 2.0 ← the learned bias
output = (1.0 * 0.5) + (2.0 * −1.0) + (3.0 * 0.25) + 2.0
= 0.5 + −2.0 + 0.75 + 2.0
= 1.25
That single number 1.25 is the node's output — what it hands to the next layer. A node is
almost useless alone; the power comes from stacking thousands of them and learning all the
weights.
Layers and the network¶
A layer is a group of nodes that all read the same inputs but have their own weights, and all compute in parallel. Nodes within a layer don't talk to each other; the "network" — the wiring — happens between layers, where one layer's outputs become the next layer's inputs.
Networks behind LLMs have dozens to hundreds of layers, in three roles:
- Input layer — accepts and processes the raw input.
- Hidden layers — every layer in between, each transforming the representation a little more.
- Output layer — produces the final prediction.
Each hidden layer emits a vector called a hidden state — the network's internal, intermediate representation of the data at that depth.
- Hidden state — the vector flowing between layers. Not human-readable; it's the model's
working representation. Its length is the dimensionality (often
d_model, e.g. 4096).
Why text representations get bigger and image ones get smaller
Text inference increases dimensionality — a token becomes a vector of thousands of numbers to capture meaning. Image models do the opposite: they reduce a million-pixel image down to a compact latent of a few thousand numbers. Same goal — a representation that's the right size to compute on — approached from opposite directions. We return to this in Image & Video Generation.
Encoders and decoders¶
Two jobs show up everywhere:
- Encoder — turns an input (text, image, audio) into an internal representation enriched with meaning.
- Decoder — turns an internal representation into an output (text, an image).
Modern LLMs are decoder-only. Encoder-only models (the BERT family of text-embedding models) are rarer today. Encoder-decoder models persist in other modalities — Whisper encodes audio, then decodes text tokens.
Composability
Neural networks are LEGO. You can fuse several into one model, or chain them into a pipeline. An image generator is literally three networks (text encoder → denoiser → VAE) bolted together. Keep this in mind — "a model" is often several models wearing a trench coat.
The most important operation: matmul¶
The single operation that dominates inference is the matrix multiplication, or matmul. A matmul takes a vector (a list of numbers) and a matrix (a grid of numbers) and produces a new vector.
The simplest neural-network layer — a linear layer (a.k.a. dense or fully-connected layer) — is exactly one matmul plus a bias:
where \(x\) is the input vector, \(W\) is the weight matrix, \(b\) is the bias vector, and \(y\) is the output.
Here's the mechanical version. Say \(x\) has 3 numbers and we want \(y\) to have 2. Then \(W\) is a \(3 \times 2\) grid and each output is a weighted sum of all inputs:
x = [x1, x2, x3]
| w11 w12 |
W = | w21 w22 | y1 = x1*w11 + x2*w21 + x3*w31 + b1
| w31 w32 | y2 = x1*w12 + x2*w22 + x3*w32 + b2
y = [y1, y2]
With real numbers, take the same x = [1, 2, 3] from the node example:
| 0.5 0.0 |
x = [1, 2, 3] | 1.0 -1.0 | b = [0.5, 0.5]
| 0.0 2.0 |
↑col1 ↑col2
y1 = 1*0.5 + 2*1.0 + 3*0.0 + 0.5 = 3.0
y2 = 1*0.0 + 2*(−1.0) + 3*2.0 + 0.5 = 4.5
y = [3.0, 4.5]
Three numbers went in, two came out — the shape of \(W\) did the resizing; the values did the mixing.
A matmul is a layer of nodes — that's the whole connection
Look at the columns of \(W\). Column 1 [0.5, 1.0, 0.0] is one node's weights; column 2
[0.0, −1.0, 2.0] is another node's weights. Computing y1 is exactly running node 1;
computing y2 is running node 2. A linear layer with 2 outputs is literally 2 nodes stacked
side by side, and the matmul runs them all at once. So everything from the node section scales
up by stacking columns — that's all a layer is.
- The shape of \(W\) (here \(3 \times 2\)) sets how many numbers go in (rows) and come out (columns).
- The values inside \(W\) are weights, learned in training, frozen at inference.
- The weights of one linear layer are a small slice of a model's total weights — a real LLM has hundreds of these.
This is why GPUs
A matmul is thousands of independent multiply-then-add operations with no dependencies between them. That is the one thing GPUs do extravagantly well — thousands of arithmetic units running the same operation in lockstep. The entire field of inference hardware exists to feed matmuls. Hold that thought for Bottlenecks.
Why depth needs non-linearity¶
Here's a subtle trap that explains a core design choice. Matmuls are composable: multiplying a vector by \(W_1\) then by \(W_2\) is the same as multiplying it once by the single matrix \(W_3 = W_1 W_2\).
# two linear layers, back to back
y = x @ W1 + b1
z = y @ W2 + b2
# but matrix multiplication is associative, so...
W3 = W1 @ W2 # precompute one matrix
z = x @ W3 + b3 # ...the two layers collapse into one
This is a disaster for deep networks. If every layer is just a matmul, a 100-layer network collapses into a single equivalent layer. All that depth — gone. The network can only ever represent linear functions (straight lines and flat planes), which can't model anything interesting.
The fix is to put a non-linear function between layers so they can't be merged. That function is the activation function.
- Activation function — a non-linear function applied element-wise to a layer's output. It (1) breaks linearity so layers don't collapse, and (2) is differentiable (or nearly so) so the network can be trained by gradient descent.
The classic is ReLU (Rectified Linear Unit) — comically simple: keep positives, zero out negatives.
Negatives become zero; positives pass through. That single kink is enough non-linearity to stop the collapse. Modern LLMs use smoother cousins — SiLU, Swish (named for resembling the Nike swoosh), and SwiGLU — but the pattern is the same: squash negatives toward zero, mostly preserve positives, stay (mostly) differentiable, run fast.
The takeaway, in one line
Linear layers do the work; activations make depth meaningful. Stack (matmul → activation)
many times and you get a function expressive enough to predict the next token. Remove the
activations and your 70-billion-parameter model is algebraically one matrix.
The cost lens: read every layer as a matmul¶
This is the framing the rest of the book uses, so internalize it:
A neural network, at inference time, is a long chain of matmuls separated by cheap element-wise functions.
That single sentence has two consequences you'll use constantly:
- The weights are the bulk of the bytes. Each matmul has a weight matrix sitting in GPU memory. To run the matmul you must read those weights. For a big model that's reading tens of gigabytes — every forward pass. This is the seed of the memory-bandwidth bottleneck.
- The activations, norms, and biases are a rounding error. Compared to the matmuls, the element-wise functions cost almost nothing in compute and memory. When you optimize inference, you optimize matmuls and data movement; you essentially ignore the rest.
a transformer's forward pass, abstracted
┌────────────────────────────────────────────────┐
│ matmul → act → matmul → act → matmul → act ... │
│ ▲ │
│ └─ reading a weight matrix from memory each │
│ time; this read is what you fight to reduce │
└────────────────────────────────────────────────┘
With this lens in hand, the next section traces a real token through a real transformer — and you will be able to point at each step and say "matmul, big; activation, free."