SPUNCH Blog

Neural Network Quantization
A practical deep dive into neural network quantization for production LLMs. This article explains why quantization is used primarily for inference, how float32 weights are converted to int8, the difference between dynamic and static quantization, which layers can be quantized, and the real trade-off between model size, speed, and output quality — illustrated with a transformer paraphrasing model reduced from 800MB to 133MB.
Anar Lavrenov

The Architecture Behind GPT-4 and Llama 3
A step-by-step, beginner-friendly walkthrough of how decoder-only Transformers (GPT-style models) work and how they’re trained-covering embeddings, positional encoding, self-attention with masks, MLP blocks, the final projection to vocabulary logits, and how softmax + cross-entropy drive learning.
Anar Lavrenov