Back to brain

Why Collapse Is an Optimization Problem (we just added a predictor)

Note: Made and developed in parallel with “Entropy’s Flaw”

In the previous post, we argued that entropy-based self-supervised learning (SSL) fails for P¬P because entropy is a statistical constraint, not a geometric one. High entropy does not guarantee meaningful distances, and uncertainty bands collapse silently even when losses remain finite.

That critique lives at the level of objectives.

This post addresses a different, orthogonal failure mode, even though both developments happened in parrallel:

Even with a well-designed geometric objective, training can still collapse — because collapse can be a stable fixed point of optimization

This distinction matters.

And it is exactly where the predictor head enters the picture.


1. Collapse is not just a bad solution — it can be a stable attractor

Consider a non-contrastive SSL setup:

L_align=E[1zszt]\mathcal{L}\_{\text{align}} = \mathbb{E}[1 - z_s \cdot z_t]

The collapsed solution

z(x)=cxz(x) = c \quad \forall x

is trivial but obvious.

What is less obvious — and more dangerous — is this:

Collapse is often not only a minimum, but dynamically stable one

That means:

This is why collapse often appears late, quietly, and irreversibly.


2. Why geometric regularizers alone are not enough

In P¬P SSL, we explicitly add:

These remove collapse as a global optimum.

But this does not automatically remove collapse as a stable trajectory.

Empirically, we observed the following failure mode repeatedly:

Formally, the system settles into a narrow manifold that satisfies all constraints but preserves almost no semantic distance.

This revealed a deeper issue:

Preventing collapse as a solution is not the same as preventing collapse as a training outcome


3. The missing piece: optimization asymmetry

Most non-contrastive SSL methods share a structural asymmetry:

However, if the student embedding is matched directly to the teacher embedding, this asymmetry is insufficient.

The reason is subtle but fundamental:

Direct alignment creates a symmetric gradient flow in representation space, even if the networks are not identical.

This symmetry restores collapse as an attractor.


4. Enter the predictor

The predictor is a small network placed only on the student side, after the projector:

zs=g(f(x)),ps=q(zs)z_s = g(f(x)), \quad p_s = q(z_s)

The loss becomes:

L=E[1pszt]\mathcal{L} = \mathbb{E}[1 - p_s \cdot z_t]

Crucially:

The predictor is usually:

This tells us something important.


5. What the predictor does not do

The predictor:

If it did, a linear predictor would fail.

But linear predictors work.

So its role must be different.


6. What the predictor actually does: reshape gradient flow

With a predictor, gradients propagate as:

Lzs=Jq(zs)(q(zs)zt)\frac{\partial \mathcal{L}}{\partial z_s} = J_q(z_s)^\top (q(z_s) - z_t)

Instead of:

Lzs=(zszt)\frac{\partial \mathcal{L}}{\partial z_s} = (z_s - z_t)

This matters enormously.

The predictor introduces a learnable Jacobian between alignment pressure and representation space.

As a result:

The trivial solution still exists — but SGD no longer flows into it.


7. Collapse as an unstable fixed point

Theoretical analyses (e.g. Tian et al.) show:

This is rare.

Most regularizers change where minima are.

The predictor changes how optimization moves.

Same loss.

Different dynamics.


8. Why this matters for P¬P

P¬P explicitly models an irreducible uncertainty band.

Early in training:

Direct alignment collapses this band too early.

The predictor delays commitment.

It allows:

In P¬P terms:

The predictor preserves epistemic uncertainty during optimization.


9. Objectives shape geometry. Predictors shape trajectories.

This leads to a clean separation of roles:

ComponentRole
AlignmentEnforces invariance
Variance / covarianceEnforces global spread
RepulsionEnforces inter-sample topology
PredictorDestabilizes collapse during training

The predictor is not a hack. It is an admission. Optimization dynamics matter as much as objectives.


10. Takeaway

Entropy-based SSL fails because it confuses statistics with geometry.

Non-contrastive SSL collapses because it ignores dynamics.

P¬P requires both:

The predictor is not about learning better features.

It is about making sure learning happens at all.