Gimle Papers Working Paper
The Smoothness Hypothesis
Why does next-token prediction learn language so well, yet fail for dynamical systems? We propose an explanation rooted in the smoothness of the syntax-to-semantics map — and what it means for building foundation models in structure-sensitive domains.
The Puzzle
NTP succeeds spectacularly in some domains — and fails in others
Language models trained via next-token prediction acquire far more than surface statistics — they learn factual knowledge, reasoning patterns, and representations that transfer across tasks. Similar success extends to video, where next-frame prediction yields models with implicit physical understanding, and to audio.
Yet NTP conspicuously fails in other domains. Protein language models learn useful sequence representations but cannot predict structure or function alone. Autoregressive models over symbolic equations require carefully designed curricula beyond raw next-token prediction. Progress in these domains has required reinforcement learning, search, or explicit structural supervision.
NTP succeeds
Language, video, audio — fluent generation, factual recall, implicit physical understanding
NTP fails
Protein folding, dynamical systems, theorem proving — requires RL, search, or structural supervision
The question
What property of a domain determines whether next-token prediction is sufficient for learning?
The Core Idea
It depends on how the syntax-to-semantics map behaves under perturbation
Language — smooth
Small perturbation → small semantic shift
Replace “cat” with “dog” and the semantics barely shifts. Most single-token substitutions yield near-synonyms, coherent alternatives, or detectable errors — all with small semantic distance. This is not accidental: languages evolved for robust communication in noisy channels.
Dynamical systems — sensitive
Small perturbation → divergent behaviour
Change $\rho$ in the Lorenz system from $24.05$ to $24.74$ and the system transitions from a stable fixed point to chaos. Change “$+$” to “$-$” in an ODE and stable orbits become divergence. Nearby points in token space can be arbitrarily far apart in behavioural space.
Language — smooth
Small perturbation → small semantic shift
Dynamical systems — sensitive
Small perturbation → divergent behaviour
The Smoothness Hypothesis
Formalising the intuition with perturbation sensitivity
Consider a domain with a syntactic space $\mathcal{S}$ (token sequences), a semantic space $\mathcal{M}$ (meanings/behaviours), and a syntax-semantics map $\phi: \mathcal{S} \to \mathcal{M}$.
Since $\mathcal{S}$ is discrete, we define perturbation sensitivity as the worst-case semantic change under a single-token edit. Let $\mathcal{N}(s) = \{s' \in \mathcal{S} : d_\mathcal{S}(s, s') = 1\}$ be the edit-distance-1 neighbourhood. The average perturbation sensitivity $\bar{\kappa}$ across the domain determines whether NTP can learn semantics as a byproduct of compression.
NTP works when $\bar{\kappa}$ is small.
When the average perturbation sensitivity $\bar{\kappa}$ is bounded and small, NTP is an effective learning signal. When $\bar{\kappa}$ is large or unbounded, learning requires paradigms that evaluate semantic correctness directly: RL, search, or constrained generation.
Why Smoothness Enables NTP
Compression becomes a semantic objective when the map is smooth
If $\bar{\kappa}$ is small, then for a typical expression $s$, all single-token edits $s' \in \mathcal{N}(s)$ produce semantically similar outputs. Syntactic neighbourhoods are semantically coherent.
This means the NTP objective — a purely syntactic, token-level loss — becomes a faithful proxy for semantic quality. A model that learns to predict tokens well in a neighbourhood of $s$ is implicitly learning about the meaning of that neighbourhood, because low $\bar{\kappa}$ guarantees that syntactically similar sequences carry similar meaning. Each training example provides semantic information about its entire edit neighbourhood, so sample efficiency improves as $\bar{\kappa}$ decreases.
Compression = understanding
Minimising $\mathcal{L}_{\text{NTP}}$ is equivalent to maximising compression. When $\phi$ is smooth, the statistical regularities that enable compression are entangled with semantic regularities.
The model cannot discover that “mat” and “rug” are distributionally interchangeable without implicitly learning that they are semantically similar: low $\bar{\kappa}$ guarantees that their distributional similarity reflects semantic similarity, not coincidence.
When smoothness fails
When $\phi$ is non-smooth, two equations differing by a single coefficient can exhibit qualitatively different dynamics. A model can achieve excellent perplexity on symbolic equation strings by learning operator frequencies and syntactic patterns without learning anything about the dynamical behaviour those equations describe. Compression and understanding decouple: the statistical regularities that enable compression are merely surface-level patterns.
Sensitivity Across Domains
From smooth to sensitive — each domain requires a different learning paradigm
Natural language
Low $\bar{\kappa}$. Languages evolved for robust communication in noisy channels. Most single-token substitutions yield near-synonyms or detectable errors — all with small semantic distance. The exceptions (negation, quantifiers) are statistically rare.
Video & audio
Low–moderate $\bar{\kappa}$. Sensory signals are produced by continuous physical processes, so small perturbations in pixels or waveforms correspond to small changes in scene content. Tokenised representations inherit this when the tokeniser is well-trained.
Code
Code appears to contradict the hypothesis: changing < to <= can completely alter behaviour, suggesting high $\bar{\kappa}$. Yet NTP succeeds because $\bar{\kappa}$ is computed under the natural distribution — real code is concentrated on a submanifold where the map is locally smooth (naming conventions, patterns, equivalent formulations). NTP-only models still fail on the non-smooth tail: edge cases, off-by-one errors, and subtle logical bugs — precisely where execution feedback is needed.
Theorem proving
High $\bar{\kappa}$. A single error in a chain of deductions invalidates the entire argument. The map from proof text to validity is highly non-smooth — requiring proof search to evaluate semantic correctness.
Protein sequences
High $\bar{\kappa}$. Single amino acid substitutions can cause misfolding or loss of function. Epistasis makes mutation effects strongly context-dependent. The sequence-to-function landscape is rugged with many local optima.
Dynamical systems
Very high $\bar{\kappa}$. The Lipschitz constant is unbounded near bifurcation points, and such sensitive points are dense in the space of dynamical systems. A single changed coefficient can transform a stable equilibrium into chaos.
The Gradient Alignment Problem
Why NTP gradients become semantically blind in non-smooth domains
NTP computes gradients of the form $\partial \mathcal{L}_{\text{NTP}} / \partial \theta$. These gradients point toward better token prediction — they tell the model how to adjust its parameters so that the next token is more likely under the training distribution.
When $\phi$ is smooth, better token prediction implies better semantic quality: the NTP gradient is aligned with the semantic gradient $\partial \mathcal{L}_{\text{semantic}} / \partial \theta$.
When $\phi$ is non-smooth, this alignment breaks. The NTP gradient and the semantic gradient can point in completely different directions. A parameter update that improves token prediction for an equation string may make the predicted dynamics worse, because the relationship between syntactic likelihood and dynamical behaviour is arbitrary near bifurcation points.
The consequence
Gradient descent in token space is semantically blind. It optimises a proxy that has lost its connection to the quantity of interest. A model can keep improving its perplexity on equation strings while learning nothing about the dynamics those equations describe.
Why RL and Search Succeed Where NTP Fails
Bypassing the non-smooth map by evaluating semantics directly
RL bypasses the map
Reinforcement learning computes a learning signal that originates in semantic space, not syntactic space. Policy gradient methods (REINFORCE, PPO) work by:
Sample
Generate a complete candidate equation, protein sequence, or game trajectory
Evaluate semantically
Simulate the equation, predict the protein's 3D structure, play out the game to a terminal state
Use as reward
Weight the gradient of the generation probability by the semantic quality of the result
This never differentiates through $\phi$. It requires $\phi$ to be computable, not smooth. A single changed coefficient that produces chaos instead of stability will receive a low reward — something NTP gradients cannot detect.
Search handles rugged landscapes
Even with a semantic reward signal, the reward landscape over candidate outputs can be rugged: many local optima, plateaus, and discontinuities. A model generating equations token-by-token faces a combinatorial space where most paths lead to semantically poor outputs, and the few good outputs are separated by large syntactic distances.
Search methods — Monte Carlo tree search, beam search with semantic scoring, evolutionary algorithms — explore the output space globally rather than following local gradients. MCTS evaluates many candidate continuations through simulation, then selects the most promising branch. It does not need the value function to be smooth.
Self-play as adaptive curriculum
Non-smooth domains have vast semantic spaces that no fixed training corpus can cover. In dynamical systems, the boundaries between behaviours are fractal in structure. Self-play — or its analogues, like iterative refinement with simulation feedback — generates training signal where the model currently fails, concentrating learning on semantically difficult regions.
RL + search is more than additive.
RL trains a value function that learns to approximate the non-smooth syntax-semantics map through direct experience. Search then uses this learned value function to explore efficiently. The value function does not need $\phi$ to be globally smooth — it builds a local approximation from semantic evaluations, effective precisely in the regions the search visits.
The Paradigm Spectrum
Ordered by how much of the semantic evaluation they internalise
Next-Token Prediction
Learning signal is purely syntactic
Requires $\phi$ to be smooth for semantic learning to occur as a byproduct of compression. The model learns because syntactic patterns are semantic patterns.
Requires smooth $\phi$Behavioural Fine-Tuning
Adds a semantic loss (simulation error, structural accuracy)
Still uses gradient-based optimisation through the model. Requires $\phi$ to be at least locally smooth near good solutions.
Requires locally smooth $\phi$RL with Semantic Reward
Learning signal is semantic, estimated through sampling
Requires $\phi$ to be evaluable, not smooth. Handles non-smooth maps but can struggle with sparse or deceptive rewards.
Requires computable $\phi$Search with Learned Value Function
The AlphaGo paradigm
Explores the output space globally, guided by a learned semantic evaluator. Maximally robust to non-smoothness. This is what dynamical systems require.
No smoothness requiredObserved in practice
Language: NTP alone (smooth $\phi$). Code: NTP + execution feedback (moderately non-smooth). Protein structure: NTP + geometric supervision (non-smooth). Game playing: NTP + RL + MCTS (highly non-smooth). Dynamical systems: the full stack.
Evidence from Within Language Itself
The history of LLM development traces the paradigm progression predicted by the hypothesis
Language is not uniformly smooth — it contains regions of varying perturbation sensitivity. The evolution of large language models traces a path through exactly the paradigm progression predicted by the smoothness hypothesis.
NTP era
GPT-2
Pure NTP produces fluent text, coherent paragraphs, and basic factual recall. These are the smooth regions of language: tasks where most token substitutions preserve essential meaning, and where compression is a reliable proxy for understanding.
Low sensitivityRLHF era
InstructGPT / ChatGPT
Instruction following occupies a less smooth region. “Summarise this” vs “don't summarise this” differs by a single token but flips the required behaviour entirely. RLHF provided a semantic learning signal rooted in human preference judgments, not token prediction.
Medium sensitivityReasoning era
o1 / extended thinking
Mathematical reasoning is the least smooth region of language. A single error in a chain of deductions invalidates everything. The latest models use inference-time search — multiple reasoning paths, process reward models, backtracking — structurally analogous to MCTS.
High sensitivityChain-of-thought as smoothing.
Instead of mapping directly from question to answer — a potentially non-smooth map — the model decomposes into intermediate steps: question $\to$ step$_1$ $\to$ step$_2$ $\to \cdots \to$ answer. Each step is a locally smooth mapping, even when the end-to-end map is not.
However, this conjecture is only directionally correct. If each step $i$ has sensitivity $\kappa_i$, the composed sensitivity is bounded by $\prod_i \kappa_i$, which can exceed the end-to-end sensitivity. Decomposition only works when the intermediate representations genuinely factor the problem into locally smooth sub-problems, rather than merely distributing the non-smoothness across more steps.
Implications
What the smoothness hypothesis means for AI research
Compression $\neq$ understanding
A special case, not a general law
A prominent view holds that compression and intelligence are equivalent. The smoothness hypothesis reveals this as a special case: compression implies understanding only when the syntax-semantics map is smooth.
In non-smooth domains, a model can achieve excellent compression of symbolic equation strings by learning syntactic patterns without learning anything about the dynamical behaviour those equations describe.
This resolves why scaling protein language models improves sequence metrics without approaching AlphaFold-level structure prediction: the sequence-to-structure map is non-smooth.
Nuancing the bitter lesson
Scale alone does not suffice everywhere
Sutton's “bitter lesson” argues that general methods leveraging computation ultimately outperform methods using human domain knowledge. The smoothness hypothesis adds an important qualification.
In smooth domains, scale alone suffices — more compute, more data, bigger models. The bitter lesson holds in its strongest form.
In non-smooth domains, representation and structure matter — the right inductive biases, search strategies, and compositional languages are prerequisites, not luxuries. The lesson is not equally bitter for all domains.
Conclusion
Finding the right language for a domain is not convenience — it is a prerequisite for learning.
For structure-sensitive domains like dynamical systems, the path forward is not simply scaling autoregressive models. It requires direct semantic evaluation, structured search, and — most importantly — representations engineered to make the syntax-semantics map as smooth as possible.