Gimle Papers Working Paper

The Smoothness Hypothesis

Why does next-token prediction learn language so well, yet fail for dynamical systems? We propose an explanation rooted in the smoothness of the syntax-to-semantics map — and what it means for building foundation models in structure-sensitive domains.

Read the full paper (PDF)

The Puzzle

NTP succeeds spectacularly in some domains — and fails in others

Language models trained via next-token prediction acquire far more than surface statistics — they learn factual knowledge, reasoning patterns, and representations that transfer across tasks. Similar success extends to video, where next-frame prediction yields models with implicit physical understanding, and to audio.

Yet NTP conspicuously fails in other domains. Protein language models learn useful sequence representations but cannot predict structure or function alone. Autoregressive models over symbolic equations require carefully designed curricula beyond raw next-token prediction. Progress in these domains has required reinforcement learning, search, or explicit structural supervision.

NTP succeeds

Language, video, audio — fluent generation, factual recall, implicit physical understanding

NTP fails

Protein folding, dynamical systems, theorem proving — requires RL, search, or structural supervision

The question

What property of a domain determines whether next-token prediction is sufficient for learning?

The Core Idea

It depends on how the syntax-to-semantics map behaves under perturbation

Language — smooth

Small perturbation → small semantic shift

Replace “cat” with “dog” and the semantics barely shifts. Most single-token substitutions yield near-synonyms, coherent alternatives, or detectable errors — all with small semantic distance. This is not accidental: languages evolved for robust communication in noisy channels.

Dynamical systems — sensitive

Small perturbation → divergent behaviour

Change $\rho$ in the Lorenz system from $24.05$ to $24.74$ and the system transitions from a stable fixed point to chaos. Change “$+$” to “$-$” in an ODE and stable orbits become divergence. Nearby points in token space can be arbitrarily far apart in behavioural space.

Language — smooth

Small perturbation → small semantic shift

Dynamical systems — sensitive

Small perturbation → divergent behaviour

The Smoothness Hypothesis

Formalising the intuition with perturbation sensitivity

Consider a domain with a syntactic space $\mathcal{S}$ (token sequences), a semantic space $\mathcal{M}$ (meanings/behaviours), and a syntax-semantics map $\phi: \mathcal{S} \to \mathcal{M}$.

Since $\mathcal{S}$ is discrete, we define perturbation sensitivity as the worst-case semantic change under a single-token edit. Let $\mathcal{N}(s) = \{s' \in \mathcal{S} : d_\mathcal{S}(s, s') = 1\}$ be the edit-distance-1 neighbourhood. The average perturbation sensitivity $\bar{\kappa}$ across the domain determines whether NTP can learn semantics as a byproduct of compression.

$$\kappa(s) = \sup_{s' \in \mathcal{N}(s)} d_\mathcal{M}(\phi(s), \phi(s'))$$ $$\bar{\kappa} = \mathbb{E}_{s \sim p(\mathcal{S})}[\kappa(s)]$$

NTP works when $\bar{\kappa}$ is small.

When the average perturbation sensitivity $\bar{\kappa}$ is bounded and small, NTP is an effective learning signal. When $\bar{\kappa}$ is large or unbounded, learning requires paradigms that evaluate semantic correctness directly: RL, search, or constrained generation.

Why Smoothness Enables NTP

Compression becomes a semantic objective when the map is smooth

If $\bar{\kappa}$ is small, then for a typical expression $s$, all single-token edits $s' \in \mathcal{N}(s)$ produce semantically similar outputs. Syntactic neighbourhoods are semantically coherent.

This means the NTP objective — a purely syntactic, token-level loss — becomes a faithful proxy for semantic quality. A model that learns to predict tokens well in a neighbourhood of $s$ is implicitly learning about the meaning of that neighbourhood, because low $\bar{\kappa}$ guarantees that syntactically similar sequences carry similar meaning. Each training example provides semantic information about its entire edit neighbourhood, so sample efficiency improves as $\bar{\kappa}$ decreases.

Compression = understanding

Minimising $\mathcal{L}_{\text{NTP}}$ is equivalent to maximising compression. When $\phi$ is smooth, the statistical regularities that enable compression are entangled with semantic regularities.

The model cannot discover that “mat” and “rug” are distributionally interchangeable without implicitly learning that they are semantically similar: low $\bar{\kappa}$ guarantees that their distributional similarity reflects semantic similarity, not coincidence.

$$s' \in \mathcal{N}(s) \;\implies\; d_\mathcal{M}(\phi(s), \phi(s')) \leq \kappa(s)$$

When smoothness fails

When $\phi$ is non-smooth, two equations differing by a single coefficient can exhibit qualitatively different dynamics. A model can achieve excellent perplexity on symbolic equation strings by learning operator frequencies and syntactic patterns without learning anything about the dynamical behaviour those equations describe. Compression and understanding decouple: the statistical regularities that enable compression are merely surface-level patterns.

Sensitivity Across Domains

From smooth to sensitive — each domain requires a different learning paradigm

Language NTP
Audio / Video NTP + perceptual
Code NTP + execution
Theorem proving NTP + proof search
Proteins NTP + geometry
Dynamical systems NTP + RL + search
Smooth (low $\bar{\kappa}$) Sensitive (high $\bar{\kappa}$)

Natural language

Low $\bar{\kappa}$. Languages evolved for robust communication in noisy channels. Most single-token substitutions yield near-synonyms or detectable errors — all with small semantic distance. The exceptions (negation, quantifiers) are statistically rare.

Video & audio

Low–moderate $\bar{\kappa}$. Sensory signals are produced by continuous physical processes, so small perturbations in pixels or waveforms correspond to small changes in scene content. Tokenised representations inherit this when the tokeniser is well-trained.

Code

Code appears to contradict the hypothesis: changing < to <= can completely alter behaviour, suggesting high $\bar{\kappa}$. Yet NTP succeeds because $\bar{\kappa}$ is computed under the natural distribution — real code is concentrated on a submanifold where the map is locally smooth (naming conventions, patterns, equivalent formulations). NTP-only models still fail on the non-smooth tail: edge cases, off-by-one errors, and subtle logical bugs — precisely where execution feedback is needed.

Theorem proving

High $\bar{\kappa}$. A single error in a chain of deductions invalidates the entire argument. The map from proof text to validity is highly non-smooth — requiring proof search to evaluate semantic correctness.

Protein sequences

High $\bar{\kappa}$. Single amino acid substitutions can cause misfolding or loss of function. Epistasis makes mutation effects strongly context-dependent. The sequence-to-function landscape is rugged with many local optima.

Dynamical systems

Very high $\bar{\kappa}$. The Lipschitz constant is unbounded near bifurcation points, and such sensitive points are dense in the space of dynamical systems. A single changed coefficient can transform a stable equilibrium into chaos.

The Gradient Alignment Problem

Why NTP gradients become semantically blind in non-smooth domains

NTP computes gradients of the form $\partial \mathcal{L}_{\text{NTP}} / \partial \theta$. These gradients point toward better token prediction — they tell the model how to adjust its parameters so that the next token is more likely under the training distribution.

When $\phi$ is smooth, better token prediction implies better semantic quality: the NTP gradient is aligned with the semantic gradient $\partial \mathcal{L}_{\text{semantic}} / \partial \theta$.

When $\phi$ is non-smooth, this alignment breaks. The NTP gradient and the semantic gradient can point in completely different directions. A parameter update that improves token prediction for an equation string may make the predicted dynamics worse, because the relationship between syntactic likelihood and dynamical behaviour is arbitrary near bifurcation points.

Smooth domain θ NTP Semantic Aligned Non-smooth domain θ NTP Semantic Misaligned

The consequence

Gradient descent in token space is semantically blind. It optimises a proxy that has lost its connection to the quantity of interest. A model can keep improving its perplexity on equation strings while learning nothing about the dynamics those equations describe.

Why RL and Search Succeed Where NTP Fails

Bypassing the non-smooth map by evaluating semantics directly

RL bypasses the map

Reinforcement learning computes a learning signal that originates in semantic space, not syntactic space. Policy gradient methods (REINFORCE, PPO) work by:

1

Sample

Generate a complete candidate equation, protein sequence, or game trajectory

2

Evaluate semantically

Simulate the equation, predict the protein's 3D structure, play out the game to a terminal state

3

Use as reward

Weight the gradient of the generation probability by the semantic quality of the result

This never differentiates through $\phi$. It requires $\phi$ to be computable, not smooth. A single changed coefficient that produces chaos instead of stability will receive a low reward — something NTP gradients cannot detect.

Search handles rugged landscapes

Even with a semantic reward signal, the reward landscape over candidate outputs can be rugged: many local optima, plateaus, and discontinuities. A model generating equations token-by-token faces a combinatorial space where most paths lead to semantically poor outputs, and the few good outputs are separated by large syntactic distances.

Search methods — Monte Carlo tree search, beam search with semantic scoring, evolutionary algorithms — explore the output space globally rather than following local gradients. MCTS evaluates many candidate continuations through simulation, then selects the most promising branch. It does not need the value function to be smooth.

Self-play as adaptive curriculum

Non-smooth domains have vast semantic spaces that no fixed training corpus can cover. In dynamical systems, the boundaries between behaviours are fractal in structure. Self-play — or its analogues, like iterative refinement with simulation feedback — generates training signal where the model currently fails, concentrating learning on semantically difficult regions.

RL + search is more than additive.

RL trains a value function that learns to approximate the non-smooth syntax-semantics map through direct experience. Search then uses this learned value function to explore efficiently. The value function does not need $\phi$ to be globally smooth — it builds a local approximation from semantic evaluations, effective precisely in the regions the search visits.

The Paradigm Spectrum

Ordered by how much of the semantic evaluation they internalise

1

Next-Token Prediction

Learning signal is purely syntactic

Requires $\phi$ to be smooth for semantic learning to occur as a byproduct of compression. The model learns because syntactic patterns are semantic patterns.

Requires smooth $\phi$

2

Behavioural Fine-Tuning

Adds a semantic loss (simulation error, structural accuracy)

Still uses gradient-based optimisation through the model. Requires $\phi$ to be at least locally smooth near good solutions.

Requires locally smooth $\phi$

3

RL with Semantic Reward

Learning signal is semantic, estimated through sampling

Requires $\phi$ to be evaluable, not smooth. Handles non-smooth maps but can struggle with sparse or deceptive rewards.

Requires computable $\phi$

4

Search with Learned Value Function

The AlphaGo paradigm

Explores the output space globally, guided by a learned semantic evaluator. Maximally robust to non-smoothness. This is what dynamical systems require.

No smoothness required

Observed in practice

Language: NTP alone (smooth $\phi$). Code: NTP + execution feedback (moderately non-smooth). Protein structure: NTP + geometric supervision (non-smooth). Game playing: NTP + RL + MCTS (highly non-smooth). Dynamical systems: the full stack.

Evidence from Within Language Itself

The history of LLM development traces the paradigm progression predicted by the hypothesis

Language is not uniformly smooth — it contains regions of varying perturbation sensitivity. The evolution of large language models traces a path through exactly the paradigm progression predicted by the smoothness hypothesis.

1

NTP era

GPT-2

Pure NTP produces fluent text, coherent paragraphs, and basic factual recall. These are the smooth regions of language: tasks where most token substitutions preserve essential meaning, and where compression is a reliable proxy for understanding.

Low sensitivity

2

RLHF era

InstructGPT / ChatGPT

Instruction following occupies a less smooth region. “Summarise this” vs “don't summarise this” differs by a single token but flips the required behaviour entirely. RLHF provided a semantic learning signal rooted in human preference judgments, not token prediction.

Medium sensitivity

3

Reasoning era

o1 / extended thinking

Mathematical reasoning is the least smooth region of language. A single error in a chain of deductions invalidates everything. The latest models use inference-time search — multiple reasoning paths, process reward models, backtracking — structurally analogous to MCTS.

High sensitivity

Chain-of-thought as smoothing.

Instead of mapping directly from question to answer — a potentially non-smooth map — the model decomposes into intermediate steps: question $\to$ step$_1$ $\to$ step$_2$ $\to \cdots \to$ answer. Each step is a locally smooth mapping, even when the end-to-end map is not.

However, this conjecture is only directionally correct. If each step $i$ has sensitivity $\kappa_i$, the composed sensitivity is bounded by $\prod_i \kappa_i$, which can exceed the end-to-end sensitivity. Decomposition only works when the intermediate representations genuinely factor the problem into locally smooth sub-problems, rather than merely distributing the non-smoothness across more steps.

Implications

What the smoothness hypothesis means for AI research

Compression $\neq$ understanding

A special case, not a general law

A prominent view holds that compression and intelligence are equivalent. The smoothness hypothesis reveals this as a special case: compression implies understanding only when the syntax-semantics map is smooth.

In non-smooth domains, a model can achieve excellent compression of symbolic equation strings by learning syntactic patterns without learning anything about the dynamical behaviour those equations describe.

This resolves why scaling protein language models improves sequence metrics without approaching AlphaFold-level structure prediction: the sequence-to-structure map is non-smooth.

Nuancing the bitter lesson

Scale alone does not suffice everywhere

Sutton's “bitter lesson” argues that general methods leveraging computation ultimately outperform methods using human domain knowledge. The smoothness hypothesis adds an important qualification.

In smooth domains, scale alone suffices — more compute, more data, bigger models. The bitter lesson holds in its strongest form.

In non-smooth domains, representation and structure matter — the right inductive biases, search strategies, and compositional languages are prerequisites, not luxuries. The lesson is not equally bitter for all domains.

Conclusion

Finding the right language for a domain is not convenience — it is a prerequisite for learning.

For structure-sensitive domains like dynamical systems, the path forward is not simply scaling autoregressive models. It requires direct semantic evaluation, structured search, and — most importantly — representations engineered to make the syntax-semantics map as smooth as possible.

Back to Gimle Papers