The Cross-Universe Symbolic Regression Tournament: Survival of the Fittest Laws

Joshua Kyan Aalampour

The Big Idea

Instead of trying to guess the right equation, CU-SRT lets equations compete for survival.

CU-SRT is a tournament-style algorithm for discovering scientific laws from data. Instead of treating overfitting as a vice, we embrace it as a generative force. The basic idea is to overfit on many independent datasets (“universes”), then pit those equations against each other. The ones that can’t generalize get eliminated, and the true law survives.

Phase A overfits locally in each universe, creating a diverse candidate pool. Phase B cross-tests every candidate against every universe and eliminates the weak. Phase C crowns a champion. The master equation:

\mathcal{L}^\star = \arg\max_{\varphi \in \mathcal{F}} \left\{ \bar{G}(\varphi) - \lambda\, \ell(\varphi) \right\}

The objective rewards cross-universe fitness and penalizes complexity. Natural selection as an argmax. The full derivation, convergence proofs, and all theoretical guarantees are in the PDF.

Pipeline

Cross-Testing Schematic

Live Tournament Simulator

Select an equation to discover, then hit Run Tournament. Watch the terminal as CU-SRT evaluates candidates across universes and eliminates the impostors.

CU-SRT

MutateSelectOutput

Universes5

λ (complexity)0.020

φ(x)

U1

U2

U3

U4

U5

G̅

T

cusrt -- newton_s_cooling_law

cusrt ~ % waiting for input...

Things to try:

Increase universes to 7 or 8. Notice how more universes makes it almost impossible for a specialist to hide. This is the exponential decay guarantee in action.
Crank the complexity weight ( $\lambda$ ) up to 0.05. Watch how longer expressions get penalized even if they’re accurate. Set it to 0 and see pure accuracy without parsimony.
Run multiple tournaments with different seeds. The true law (the one with consistent cross-universe fitness) should win nearly every time.

Key Guarantees

Exponential decay. The probability that a spurious formula survives decays exponentially with the number of universes. For any non-true candidate $\tilde{\varphi}$ deviating from $\mathcal{L}$ by at least $\Delta > 0$ :

\Pr\!\big\{\bar{G}(\tilde{\varphi}) \geq \bar{G}(\mathcal{L}) - \zeta\big\} \leq \exp(-2N\zeta^2)

More universes means exponentially less chance of being fooled.

Finite sample guarantee. If the number of universes satisfies:

N \geq \frac{\log|\mathcal{C}| + \log(1/\beta)}{2\Delta^2}

then CU-SRT selects the true law $\mathcal{L}$ with probability at least $1 - \beta$ .

Geometric contraction. With adaptive thresholds, the candidate pool after $t$ rounds satisfies $|\mathcal{C}^{(t)}| \leq |\mathcal{C}^{(1)}|(1-q)^t$ , decaying geometrically.

Optional Extensions

The paper introduces four plug-in modules, each preserving all theoretical guarantees:

Universe-Weighted Scores. Noisier universes get down-weighted via inverse-variance weighting, so data-rich, clean universes steer the tournament.
Stochastic Grammar Annealing. Useful primitives get sampled more often. Useless operators are demoted but never deleted, preserving exploration.
Causal-Graph Pruning. Equations that violate known causal sign constraints are culled before cross-universe testing even begins.
Bayesian Tournament Scoring. Replace accuracy with full Bayesian marginal likelihood, injecting an automatic Occam factor.

See the PDF for complete formulations and proofs of all extensions.

References

C. Darwin, On the Origin of Species, 6th ed., John Murray, 1872.
F. Nietzsche, Thus Spoke Zarathustra: A Book for All and None, 1883-1885.
H. Spencer, The Principles of Biology, vol. 1, Williams & Norgate, 1864.
R. Dawkins, The Selfish Gene, Oxford University Press, 1976.