
From Toy Model to DeepSeek Giant: The Innocence of x + f(x)
An empirical autopsy of what transformers actually learn, conducted via a deliberately unconventional architecture called VibeNet. Abstract This document summarises findings from a series of live training experiments on VibeNet — a deliberately stripped-down language model with no QKV projections, no FFN blocks in its original form, and an untied lm_head nicknamed "Karen." Using a custom autopsy toolkit measuring gradient norms, effective rank, attention entropy, and activation statistics at every layer, we discovered that the field's core architectural assumptions — depth, QKV projections, and the residual identity shortcut — are not the source of learning. They are, at best, passengers. At worst, they are an actively misleading abstraction that hid the real gradient topology for a decade. The same physics that caused a 2-layer toy model to hit loss 4.4 without NaN caused DeepSeek's 27B-parameter model to explode. The innocent equation is the same: x + f(x) 1. The Architecture: VibeNe
Continue reading on Dev.to
Opens in a new tab



