Below described all components of a decoder-only transformer step by step and in maximum detail. I consider the entire process — from the moment text enters the model to the output after the final layer.
Embedding Layer
Neural networks cannot operate directly on raw words. For processing, words are converted into tokens (numbers). For example, the phrase “I like reading comics” before entering the model might be transformed into [10, 6, 63, 71], where each word is assigned a unique number.
In a transformer, the first layer that receives these tokens is the embedding layer. This layer maps each token into an n-dimensional space. Suppose the dimensionality is 4. Then token "10" becomes a vector of 4 values.
Why is this done? The token vector is not random — the embedding layer has trainable weights. Therefore, each token’s vector becomes a learned numerical representation — a digital fingerprint of a specific word in the sentence.
Positional Encoding
One of the key differences between transformer architectures and recurrent neural networks (including LSTM) is that transformers rely exclusively on linear tensor operations. All mathematical operations are reduced to matrix multiplications applied to the entire sequence simultaneously, whereas recurrent layers process tokens sequentially.
On one hand, this parallelism accelerates computation and ensures more stable training with reduced risk of vanishing gradients in long sequences. On the other hand, transformers initially have no notion of word order. For example, the sentences:
- I like reading comics.
- Reading comics is what I like.
would initially be perceived identically by a transformer. This is unacceptable. Positional encoding exists to provide the model with information about word order.
Assume the sequence “ I like reading comics”. After the embedding layer, each token is represented as a 4-dimensional vector:
[0.1, 1.3, 4.1, 5.1] — "i"
[0.2, 1.8, 3.9, 2.2] — "like"
[0.2, 1.8, 3.9, 2.2] — "like"
[1.1, 3.3, 4.4, 5.5] — "reading"
[0.2, 4.3, 2.7, 6.6] — "comics"
Positional encoding is computed as:
position = [0, 1, 2, ..., max_len - 1]^T
div_term = exp(arange(0, d_model, 2) * (-log(10000.0) / d_model))
div_term = exp(arange(0, d_model, 2) * (-log(10000.0) / d_model))
pe[:, 0, 0::2] = sin(position * div_term)
pe[:, 0, 1::2] = cos(position * div_term)
Let max_seq_len = 4, so:
position = [0, 1, 2, 3]
For d_model = 4:
arange(0, d_model, 2) = [0, 2]
-log(10000) / 4 ≈ -2.30
-log(10000) / 4 ≈ -2.30
div_term = [1.0, 0.010]
Multiplying:
position * div_term =
[[0.0000, 0.0000],
[[0.0000, 0.0000],
[1.0000, 0.0100],
[2.0000, 0.0200],
[3.0000, 0.0300]]
Applying sin to even indices and cos to odd:
Sin:
[[0.0000, 0.0000],
[0.8415, 0.0100],
[0.8415, 0.0100],
[0.9093, 0.0200],
[0.1411, 0.0300]]
Cos:
[[ 1.0000, 1.0000],
[ 0.5403, 0.9999],
[ 0.5403, 0.9999],
[-0.4161, 0.9998],
[-0.9900, 0.9996]]
Resulting positional encoding matrix:
[[ 0.0000, 1.0000, 0.0000, 1.0000],
[ 0.8415, 0.5403, 0.0100, 0.9999],
[ 0.8415, 0.5403, 0.0100, 0.9999],
[ 0.9093, -0.4161, 0.0200, 0.9998],
[ 0.1411, -0.9900, 0.0300, 0.9996]]
These encodings are added to the embeddings elementwise.
Self-Attention Layer
The encoded tensor is then passed to the self-attention layer — the core component of transformers.
Two mask types are important:
1. Square Subsequent Mask (Causal Mask)
This mask prevents information leakage from future tokens during generation. It assigns -∞ to future positions so that after Softmax they become zero probability. This ensures autoregressive behavior: the model can only attend to previous tokens.
2. Key Padding Mask
Sequences in a batch must have equal length. Shorter sequences are padded with zeros:
[1, 7, 10, 4, 5]
[2, 8, 0, 0, 0]
[2, 8, 0, 0, 0]
[4, 1, 3, 4, 0]
[2, 3, 4, 0, 0]
Padding tokens carry no semantic meaning and must not influence attention computations. The key padding mask ensures they are ignored.
Cross-Attention
Decoder-only transformers do not include cross-attention because they lack an encoder. The decoder attends only to itself.
Feed Forward Network (MLP)
Attention operations are strictly linear:
Softmax(QKᵀ / √d_k) V
Linearity limits expressive power. The Feed Forward Network introduces non-linearity. Structure:
Linear → ReLU → Linear
The first linear layer expands dimensionality:
d_model = 4
d_ff = 16
Each token becomes a 16-dimensional vector. ReLU sets negative values to zero, breaking linear dependencies and enabling non-linear generalization. The second linear layer projects back from 16 to 4 dimensions. This entire sequence (Self-Attention + MLP) forms one decoder block. In GPT-style models, this block is repeated dozens of times (e.g., 96 times).
Final Linear Layer
Assume the decoder output tensor is [128, 512]: 128 tokens 512-dimensional hidden state The final linear layer projects this to:
[128, vocab_size]
For each token position, the model outputs logits over the entire vocabulary.
Training Process
Models such as GPT-4 and Llama 3 are decoder-only transformers.
They are trained using self-supervised learning via next-token prediction.
Example:
Input:
"I love reading"
Target:
"love reading books"
The context window size equals the number of input tokens.
Batch Formation
Example batch:
[
["I love reading", "love reading books"],
["I love reading", "love reading books"],
["She enjoys painting", "enjoys painting landscapes"],
["We explore nature", "explore nature daily"],
["They play music", "play music together"]
]
Batch size = 4.
### Tokenization
LLMs use subword tokenization such as BPE (Byte Pair Encoding), enabling handling of unseen words.
### Model Output
Output tensor shape:
[batch, sequence_length, vocab_size]
For input:
"I love reading"
Output shape:
[1, 3, 5000]
Each position contains logits:
love | [-1.2, 4.5, -2.3, 0.7, ...] reading | [-3.0, -2.5, 5.1, 0.3, ...] reading | [-3.0, -2.5, 5.1, 0.3, ...] books | [-2.8, -1.7, -0.4, 6.2, ...]
After Softmax:
love | [0.002, 0.731, 0.001, ...] reading | [0.0001, 0.0004, 0.998, ...] books | [0.0001, 0.0004, 0.85, ...]
### Cross Entropy Loss
Assume true token indices:
>
love → 2
love → 2
reading → 3
books → 4
Cross-entropy uses only the probability of the correct token:
−log(0.001) = 6.9 −log(0.998) = 0.002 −log(0.85) = 0.16
Final loss:
(6.9 + 0.002 + 0.16) / 3 = 2.35
Lower loss indicates better model performance.
Backpropagation computes gradients and updates weights accordingly. Successful training is characterized by decreasing loss across epochs.

