Machine learning

From Fermat to Flow Matching

May 2026

A three-hundred-year path through least action

This started while I was watching a Veritasium video on Maupertuis and the principle of least action. As I watched it, I kept feeling similarities with diffusion models and flow matching, so I tried to follow that intuition from physics into generative modeling.

Jathushan Rajasegaran — UC Berkeley

1. A puzzle about bending light

In 1662, Pierre de Fermat was thinking about a peculiar fact: when light passes from air into water, it bends. The angle of bending obeys a simple rule — Snell's law — that had been known empirically for decades. But Fermat asked a different question: why this rule? What was the light doing?

His answer was strange and beautiful. Light, he proposed, picks the path that takes the least time. Not the shortest path — the fastest one. In air, light moves quickly; in water, more slowly. To get from a point above the surface to a point below, the straight-line path is the shortest, but the quickest path bends at the surface, spending more time in fast air and less in slow water. Work out the geometry, and Snell's law falls out exactly.

It was a strange thing to say. Light, somehow, was choosing. Every photon, before setting out, seemed to survey all possible routes and pick the cheapest one. Fermat himself was uncomfortable with the metaphysics; his contemporaries were worse. But the rule worked.

What no one in 1662 could have guessed is that Fermat had stumbled onto something far larger than optics. The idea that nature picks the cheapest path would, over the next three hundred years, eat physics whole — mechanics, electromagnetism, general relativity, quantum field theory. And, eventually, it would give us a way to derive a 2023 machine learning algorithm from first principles.

This post is about that arc. About how a single idea — nature optimizes — connects a 17th-century optics puzzle to flow matching, the technique used to train modern generative models. We will walk the historical thread quickly, then watch flow matching fall out of the same machinery that Lagrange and Hamilton built for mechanics. The math is well-known to specialists; the view, I think, is worth sharing.

2. Maupertuis and his thrifty universe

Eighty years after Fermat, in 1744, Pierre Louis Maupertuis took the principle and made it cosmic.

Maupertuis was president of the Berlin Academy, a favorite of Frederick the Great, and a man whose ambitions ran ahead of his mathematics. In a paper titled Accord de différentes lois de la nature qui avaient jusqu'ici paru incompatibles — "Agreement of different laws of nature that had hitherto appeared incompatible" — he proposed that Fermat's principle wasn't just about light. It was a universal law. Everything in nature, everywhere, picks the path of least action.

What was "action"? Maupertuis was vague. He defined it as something like $\int m v \, ds$ — mass times velocity, integrated along the path. It worked for some examples and failed for others, but the vision was clear: God had built a thrifty universe. "Nature, in producing its effects, always uses the simplest means."

This was, to put it gently, controversial. Maupertuis got into a vicious priority dispute with followers of Leibniz, who claimed the idea had been Leibniz's all along. Voltaire — never one to miss a fight — wrote a satirical pamphlet, the Diatribe du Docteur Akakia, that mocked Maupertuis mercilessly. Frederick the Great, defending his Academy president, had copies of the pamphlet publicly burned in the streets of Berlin. The whole affair was a mess.

But Maupertuis had the right vision, even when he didn't have the right math.

3. Euler does the math

The math came, quietly, the same year. Leonhard Euler — in the middle of writing what would become one of the most influential textbooks in the history of mathematics, Methodus inveniendi lineas curvas — included an appendix that did, rigorously, what Maupertuis had been waving at.

Euler showed that for a particle moving in a potential field, the quantity $\int v \, ds$ is extremized along the actual trajectory. He gave Maupertuis credit for the idea but did the work. This is a pattern that repeats: Maupertuis announced, Euler proved.

More importantly, Euler invented the tool that made it all possible: the calculus of variations. Hand him a quantity that depends on a path, and he hands back a differential equation whose solutions are the extremal paths. The Euler equation — what we now call the Euler–Lagrange equation — turns the philosophical claim "nature optimizes" into a computational procedure.

Suddenly, "least action" wasn't a slogan. It was a method.

4. Lagrange and the analytical machine

Then came Lagrange.

In 1788, Joseph-Louis Lagrange published Mécanique analytique, a book famous for, among other things, its preface, where he boasted: "No diagrams will be found in this work." Lagrange was determined to reformulate all of mechanics — Newton's laws, planetary motion, vibrating strings, the lot — without recourse to geometric intuition. Just calculus, just symbols.

His central claim was this: for any mechanical system, write down a single function $L = T - V$, where $T$ is the kinetic energy and $V$ is the potential energy. Then the actual motion is the one that extremizes

$$S = \int_{t_0}^{t_1} L\, dt.$$

That's it. Newton's $F = ma$, the orbits of planets, the swing of a pendulum — all of it follows by varying this single integral. The machinery is the same for every problem. You write down $L$, you crank the Euler–Lagrange equation, you get the equation of motion.

This is the moment when "nature is thrifty" becomes industrial. Lagrange's framework would, over the next two centuries, be applied to every physical system anyone could think of. It is, without exaggeration, one of the most powerful ideas in the history of science.

5. Hamilton sees the deep structure

In the 1830s, William Rowan Hamilton — Irish, prodigy, troubled, brilliant — looked at Lagrange's mechanics and noticed something strange.

The equations he was writing down for a particle in a potential looked, mathematically, like the equations for the propagation of light through a medium with varying refractive index. The trajectories of particles and the rays of light obeyed the same kind of equation. There was an analogy between mechanics and optics that ran far deeper than anyone had suspected.

Hamilton's response was to reformulate mechanics yet again, this time in a way that made the optics–mechanics analogy structural. He introduced what we now call the Hamiltonian (the energy, expressed in terms of position and momentum) and a function $\phi(x, t)$ whose gradient gave the momentum field. The wavefronts of this function evolved according to a partial differential equation now called the Hamilton–Jacobi equation:

$$\partial_t \phi + \frac{1}{2}|\nabla \phi|^2 + V(x) = 0.$$

This is the equation we will meet again, in a moment, in a context Hamilton could not have imagined.

(Worth noting: it was the Hamilton–optics analogy that, ninety years later, gave Schrödinger the idea for wave mechanics. If classical mechanics is like geometric optics, and geometric optics is the high-frequency limit of wave optics, then maybe mechanics is the high-frequency limit of some deeper wave theory. That deeper theory turned out to be quantum mechanics. Hamilton's analogy ran very deep indeed.)

6. Feynman makes it total

Skip to 1948. Richard Feynman, in his PhD thesis and the papers that followed, proposes that quantum mechanics itself can be reformulated as a least action principle — with a twist.

In Feynman's version, a quantum particle does not pick a single path. It takes all of them. Each path contributes a complex number, $e^{iS/\hbar}$, where $S$ is the classical action of that path. The amplitudes interfere. In the limit where $\hbar$ is small compared to $S$ — that is, for macroscopic objects — the interference is sharply peaked around the path of stationary action. That's the classical path. Newton's $F = ma$ is what you see when quantum interference picks out one trajectory from infinity.

This is the final form of the principle: not "nature picks the least action path" but "nature sums over all paths, weighted by action." Classical mechanics is the limit where the sum collapses to a single trajectory.

By 1948, the arc is complete. Three hundred years after Fermat, every fundamental physical theory we have — classical mechanics, electromagnetism, general relativity, quantum field theory — is written as a least action principle. You pick an action functional, you extremize (or sum over paths), you get the physics.

The question that animates the rest of this post is: why stop at physics?

7. The pivot

Here is the move.

In every example we just walked through — Fermat's light ray, Lagrange's pendulum, Hamilton's wavefronts, Feynman's quantum paths — the configuration that evolves in time is a point, a field, or an amplitude. But the machinery doesn't care. The calculus of variations works on any space of paths. The configuration could be anything that changes in time, as long as we can write down a kinetic energy, a potential, and a meaningful integral.

So: what if the configuration is a probability distribution?

What if we have a density $p(x, t)$ that evolves from $p_0(x)$ at $t = 0$ to $p_1(x)$ at $t = 1$, and we want to know: which evolution is the "right" one? Which one would Maupertuis pick?

This is not an idle question. It is exactly the setting of flow matching, the technique that underlies a large fraction of modern generative AI — from rectified flows to the velocity-field training objectives at the heart of Stable Diffusion 3 and many of its successors. In flow matching, we transport mass from a simple distribution (say, a Gaussian) to a complex one (say, the distribution of photographs of cats) by following a learned velocity field $v(x, t)$. The density evolves according to the continuity equation:

$$\partial_t p + \nabla \cdot (p\, v) = 0,$$

which is just the statement that probability mass is neither created nor destroyed — it only flows.

There are infinitely many velocity fields $v$ that transport $p_0$ to $p_1$. Which one should we pick?

If we are willing to follow Maupertuis — and three hundred years of physics — the answer should be: the one that minimizes some action.

8. Writing down the action

Let's build it.

What is the kinetic energy of a flowing probability distribution? If we think of the density as a continuum of particles, each at position $x$ with velocity $v(x, t)$ and weighted by the density $p(x, t)$, then the total kinetic energy is

$$T \;=\; \frac{1}{2} \int p(x, t)\, |v(x, t)|^2 \, dx.$$

This is the natural choice. It is, literally, the kinetic energy of all the moving probability mass. No magic.

For the potential, we have more freedom. In ordinary mechanics, $V(x)$ encodes forces — gravity, springs, electric fields. For probability distributions, what plays the role of a force? A natural choice is a term that pulls the distribution toward (or away from) some reference distribution $p_{\text{ref}}(x, t)$:

$$U \;=\; \lambda \int p(x, t)\, \log \frac{p(x, t)}{p_{\text{ref}}(x, t)} \, dx.$$

This is the Kullback–Leibler divergence, weighted by a coupling constant $\lambda$. When $\lambda = 0$, there is no potential; we just minimize kinetic energy. When $\lambda$ is large, the distribution is strongly attracted toward $p_{\text{ref}}$.

The total action is then

$$S \;=\; \int_0^1 \! \int \left[\,\tfrac{1}{2}\, p\, |v|^2 \;+\; \lambda\, p \log \frac{p}{p_{\text{ref}}}\,\right] dx\, dt.$$

We are free to choose $p_{\text{ref}}$ — it is the "potential landscape" of our probability flow. Different choices, as we will see, recover different generative-modeling regimes.

But there is a constraint. The density $p$ and the velocity $v$ are not independent — they are linked by the continuity equation. Following Lagrange, we enforce the constraint with a Lagrange multiplier $\phi(x, t)$ and minimize the augmented action

$$S \;=\; \int_0^1 \! \int \left[\, \tfrac{1}{2}\, p\, |v|^2 \;+\; \lambda\, p \log \frac{p}{p_{\text{ref}}} \;+\; \phi \big(\partial_t p + \nabla \!\cdot\! (p\, v)\big) \,\right] dx \, dt.$$

This is now a problem Lagrange would have recognized: a single functional, three fields to vary over ($p$, $v$, $\phi$), and a familiar machinery to crank.

9. Cranking the machine

Vary with respect to $v$. The $v$-dependent terms are $\tfrac{1}{2}\, p\, |v|^2$ and $\phi\, \nabla \!\cdot\! (p\, v)$. The latter, after integration by parts, becomes $-p\, \nabla \phi \cdot v$. Setting the variation to zero:

$$p\, v \;-\; p\, \nabla \phi \;=\; 0 \quad\Longrightarrow\quad \boxed{\,v \;=\; \nabla \phi.\,}$$

This is already remarkable. The optimal velocity field is a gradient — the velocity of the probability flow is the gradient of the Lagrange multiplier. Just as in Hamilton's mechanics, where the momentum is the gradient of an action function, here the velocity is the gradient of $\phi$. The mathematical structure is identical.

Vary with respect to $p$. The $p$-dependent terms are $\tfrac{1}{2}\, p\, |v|^2$, $\lambda\, p \log(p / p_{\text{ref}})$, and the $\phi$-coupled terms $\phi\, \partial_t p$ and $\phi\, \nabla \!\cdot\! (p\, v)$. After integrating by parts (in time for the first $\phi$ term, in space for the second) and setting the variation to zero, we obtain

$$\tfrac{1}{2} |v|^2 \;+\; \lambda \left(\log \frac{p}{p_{\text{ref}}} + 1\right) \;-\; \partial_t \phi \;-\; |\nabla \phi|^2 \;=\; 0.$$

Substituting $v = \nabla \phi$ and rearranging:

$$\boxed{\;\partial_t \phi \;+\; \tfrac{1}{2}\, |\nabla \phi|^2 \;=\; \lambda \left(\log \frac{p}{p_{\text{ref}}} \;+\; 1\right).\;}$$

Stop and look at this equation. It is the Hamilton–Jacobi equation, with a potential on the right-hand side that depends on the density itself. Hamilton wrote down this equation in 1834 for a particle in a potential field. Here it is again, almost two centuries later, governing the evolution of $\phi$ in a probability flow.

Vary with respect to $\phi$. This simply gives back the continuity equation. So our complete dynamics are:

1. The velocity is the gradient of a potential: $\;v = \nabla \phi.$ 2. The potential evolves by a modified Hamilton–Jacobi equation: $\;\partial_t \phi + \tfrac{1}{2} |\nabla \phi|^2 = \lambda\, \big(\log (p / p_{\text{ref}}) + 1\big).$ 3. The density evolves by continuity: $\;\partial_t p + \nabla \!\cdot\! (p\, \nabla \phi) = 0.$

Three coupled equations, three unknowns, all derived from a single action principle. Hamilton would have smiled.

10. The knob

The parameter $\lambda$ is a knob. Turning it sweeps us through a family of generative-modeling problems that, at first glance, look unrelated but are revealed by the action principle to be the same thing with different settings.

$\lambda = 0$ — Optimal transport. With no potential, the action is pure kinetic energy. We are asking: what is the least-effort way to transport mass from $p_0$ to $p_1$? This is the Benamou–Brenier formulation of optimal transport, written down in 2000 and the cornerstone of modern Wasserstein geometry. The Hamilton–Jacobi equation collapses to $\partial_t \phi + \tfrac{1}{2}|\nabla \phi|^2 = 0$ — exactly Hamilton's equation for a free particle. The optimal flows are geodesics in Wasserstein space: straight lines, in a suitable sense, between distributions.

$\lambda > 0$, $p_{\text{ref}}$ uniform — Entropic optimal transport. The KL term collapses to negative entropy. We now want transport that is cheap and keeps the distribution spread out. This is the regime studied by Cuturi, Peyré, and many others. Entropic OT has been the workhorse of computational optimal transport for the past decade.

$\lambda > 0$, $p_{\text{ref}}$ the law of Brownian motion — Schrödinger bridges. This is the most beautiful special case. The Schrödinger bridge problem, posed by Erwin Schrödinger in 1931, asks: given a prior diffusion (say, Brownian motion) and observed densities at two endpoints, what is the most likely path the system took between them? The answer turns out to be precisely a flow of the form we just derived, with $p_{\text{ref}}$ taken as the law of Brownian motion. And — this is the punchline — the Schrödinger bridge is the cleanest mathematical link between flow matching and score-based diffusion models. When De Bortoli and collaborators framed score-based diffusion in 2021 as a diffusion Schrödinger bridge, they were really showing that diffusion models live at a particular point on our $\lambda$-dial.

$\lambda \to \infty$ — The static limit. The kinetic term becomes negligible. The distribution just follows $p_{\text{ref}}$, ignoring the endpoints. Uninteresting, but instructive: it tells you that $\lambda$ trades off "follow the reference" against "actually move mass to where we want it to end up."

| $\lambda$ | $p_{\text{ref}}$ | Regime | |---|---|---| | $0$ | — | Optimal transport (Benamou–Brenier) | | $>0$ | uniform | Entropic OT | | $>0$ | Brownian motion | Schrödinger bridge ≈ score-based diffusion | | $\to \infty$ | any | Pure reference (static limit) |

One action, one knob, four regimes. Optimal transport, entropic OT, Schrödinger bridges, and the static limit all fall out of the same variational principle by changing $\lambda$ and $p_{\text{ref}}$.

This is what I meant in the opening by unification. The view from least action is not new mathematics — every one of these regimes has been derived and re-derived from its own angle, often by people who would not have framed their work in these terms. But it is, I think, the cleanest way to see how they fit together. Three hundred years after Maupertuis, the principle still has the power to make different-looking problems look the same.

11. Coda

There is something funny about all this.

Maupertuis was, in many ways, wrong. He was wrong about the precise form of the action. He was wrong about the theology — the universe, as far as we can tell, was not designed to be efficient by anyone. He was wrong about Leibniz. He was bad at math and worse at politics. Voltaire was, by most accounts, basically right to mock him.

But he was right about one thing, and it was a big thing.

He was right that nature, somehow, picks the elegant path. That whenever something must change — light bending at the water's surface, a planet orbiting the sun, a quantum particle interfering with itself, a probability distribution morphing from noise into a photograph of a cat — there is a quantity, the action, that the change extremizes. He could not have anticipated what the configurations would turn out to be, or what the actions would look like, or that one of them would be a generative model running on a GPU. But he was right that the principle would generalize. He was right that nature is thrifty in all its actions.

Flow matching, derived from least action, is the latest entry in a three-hundred-year list. It will not be the last.

A note on "least" versus "stationary"

A pedant's footnote, which physicists will demand: the principle is more accurately called the principle of stationary action. Most extrema of the action are saddle points, not minima. "Least action" persists as a name for historical reasons — Maupertuis really did want a minimum — and because in the simplest examples (a free particle, a geodesic) the action is genuinely minimized. Everything in this post goes through with "stationary" substituted for "least"; the rhetorical convenience of the older name is too good to give up.