This is a condensed version of the full paper. The complete version with formal definitions and proofs is available on arXiv (coming soon).
Why Next-Token Prediction Learns Language but Not Dynamical Systems
Next-token prediction (NTP) learns semantics remarkably well for language, video, and audio — yet fails for dynamical systems and protein folding, where reinforcement learning and structured search become necessary. We propose an explanation: the smoothness of the syntax-to-semantics map.
The core idea
In language, most local token perturbations preserve meaning. Replace "cat" with "dog" and the semantics barely shifts. The syntax-semantics map is smooth — syntactic neighborhoods are semantically coherent.
In dynamical systems, a single changed coefficient can trigger bifurcations, chaos, or divergence. Two equations differing by one number can exhibit qualitatively different behaviour. The map from syntax (symbolic equations) to semantics (dynamical behaviour) is sensitive: nearby points in token space can be arbitrarily far apart in behavioural space.
Language — smooth
Small perturbation → small semantic shift
Dynamical systems — sensitive
Small perturbation → divergent behaviour
Why this matters for learning
When a model minimises next-token prediction loss, it's maximising compression. In smooth domains, the statistical regularities that enable compression are entangled with semantic regularities. The model cannot discover that "mat" and "rug" are distributionally interchangeable without implicitly learning they are semantically similar.
When the map is non-smooth, this coupling breaks. A model can achieve excellent compression of equation strings by learning syntactic patterns — operator frequencies, variable naming conventions, common coefficient ranges — without learning anything about the dynamical behaviour those equations describe. Compression and understanding decouple.
Sensitivity across domains
Natural language is smooth because languages evolved for robust communication in noisy channels. Most single-token substitutions yield near-synonyms or detectable errors — all with small semantic distance.
Dynamical systems are at the opposite extreme. Changing $\rho$ in the Lorenz system from $24.05$ to $24.74$ transitions from a stable fixed point to chaos. The Lipschitz constant is unbounded near bifurcation points, and such sensitive points are dense in the space of dynamical systems.
Why RL and search succeed where NTP fails
NTP computes gradients that point toward better token prediction. When the syntax-semantics map is smooth, this aligns with better semantic quality. When it isn't, the NTP gradient and the semantic gradient can point in completely different directions.
Reinforcement learning resolves this by computing a learning signal that originates in semantic space. It samples complete outputs, evaluates them semantically (simulate the equation, predict the structure, play out the game), and uses the evaluation as reward. Crucially, this never differentiates through the non-smooth map — it only requires the map to be computable, not smooth.
Search methods — Monte Carlo tree search, beam search with semantic scoring — go further by exploring the output space globally rather than following local gradients. This is maximally robust to non-smoothness.
Evidence from within language itself
The history of LLM development traces exactly the paradigm progression predicted by the smoothness hypothesis:
Next-Token Prediction
GPT-2 era
Pure NTP produces fluent text and basic factual recall. These are the smooth regions of language — tasks where most token substitutions preserve essential meaning, and compression is a reliable proxy for understanding.
Low sensitivityRLHF
InstructGPT / ChatGPT era
Instruction following occupies a less smooth region. "Summarise this" vs "don't summarise this" differs by a single token but flips the required behaviour entirely. RLHF provides a semantic learning signal rooted in human preference, not token prediction.
Medium sensitivityInference-time search
o1 / extended thinking era
Mathematical reasoning is the least smooth region of language. A single error in a chain of deductions invalidates everything. The latest models use inference-time search — multiple paths, process reward models, backtracking — structurally analogous to MCTS.
High sensitivityStructured representations + RL + search
What dynamical systems require
The full stack: NTP pretraining for syntactic priors, RL for semantic reward, search for global exploration, and representations engineered to make the syntax-semantics map as smooth as possible.
Very high sensitivityChain-of-thought as smoothing — Instead of mapping directly from question to answer (potentially non-smooth), the model decomposes into intermediate steps. Each step is a locally smooth mapping, even when the end-to-end map is not. Chain-of-thought reparameterises the problem to navigate a smoother path through semantic space.
Implications
Compression does not imply understanding — at least not in general. It implies understanding only when the syntax-semantics map is smooth. This resolves why scaling protein language models improves sequence metrics without approaching AlphaFold-level structure prediction.
The bitter lesson has a caveat. In smooth domains, scale alone suffices. In non-smooth domains, representation and structure matter — the right inductive biases, search strategies, and compositional languages are prerequisites, not luxuries.
For dynamical systems, the path forward is not simply scaling autoregressive models. It requires direct semantic evaluation, structured search, and representations engineered to make the syntax-semantics map as smooth as possible. Finding the right language for a domain is not convenience — it is a prerequisite for learning.