Why Collapse Is an Optimization Problem (we just added a predictor)

Note: Made and developed in parallel with “Entropy’s Flaw”

In the previous post, we argued that entropy-based self-supervised learning (SSL) fails for P¬P because entropy is a statistical constraint, not a geometric one. High entropy does not guarantee meaningful distances, and uncertainty bands collapse silently even when losses remain finite.

That critique lives at the level of objectives.

This post addresses a different, orthogonal failure mode, even though both developments happened in parrallel:

Even with a well-designed geometric objective, training can still collapse — because collapse can be a stable fixed point of optimization

This distinction matters.

And it is exactly where the predictor head enters the picture.

1. Collapse is not just a bad solution — it can be a stable attractor

Consider a non-contrastive SSL setup:

Two augmented views of the same sample
A student embedding $z_s$
A teacher embedding $z_t$ (EMA or stop-gradient)
An alignment loss in embedding space

\mathcal{L}\_{\text{align}} = \mathbb{E}[1 - z_s \cdot z_t]

The collapsed solution

z(x) = c \quad \forall x

is trivial but obvious.

What is less obvious — and more dangerous — is this:

Collapse is often not only a minimum, but dynamically stable one

That means:

Small perturbations decay
SGD trajectories flow toward collapse
Training “works” until representations lose rank

This is why collapse often appears late, quietly, and irreversibly.

2. Why geometric regularizers alone are not enough

In P¬P SSL, we explicitly add:

variance floors,
covariance penalties,
optional soft repulsion.

These remove collapse as a global optimum.

But this does not automatically remove collapse as a stable trajectory.

Empirically, we observed the following failure mode repeatedly:

Alignment pulls representations together
Variance penalties hold them barely apart
The system converges to a low-rank, low-curvature equilibrium
Loss remains finite
Geometry is unusable

Formally, the system settles into a narrow manifold that satisfies all constraints but preserves almost no semantic distance.

This revealed a deeper issue:

Preventing collapse as a solution is not the same as preventing collapse as a training outcome

3. The missing piece: optimization asymmetry

Most non-contrastive SSL methods share a structural asymmetry:

The student is updated by gradients
The teacher is not (EMA or stop-gradient)

However, if the student embedding is matched directly to the teacher embedding, this asymmetry is insufficient.

The reason is subtle but fundamental:

Direct alignment creates a symmetric gradient flow in representation space, even if the networks are not identical.

This symmetry restores collapse as an attractor.

4. Enter the predictor

The predictor is a small network placed only on the student side, after the projector:

z_s = g(f(x)), \quad p_s = q(z_s)

The loss becomes:

\mathcal{L} = \mathbb{E}[1 - p_s \cdot z_t]

Crucially:

The teacher never sees q
The teacher embedding is the target
The student embedding is not directly aligned

The predictor is usually:

shallow,
low-capacity,
sometimes even linear.

This tells us something important.

5. What the predictor does not do

The predictor:

does not add representational power,
does not encode semantics,
does not increase expressivity.

If it did, a linear predictor would fail.

But linear predictors work.

So its role must be different.

6. What the predictor actually does: reshape gradient flow

With a predictor, gradients propagate as:

\frac{\partial \mathcal{L}}{\partial z_s} = J_q(z_s)^\top (q(z_s) - z_t)

Instead of:

\frac{\partial \mathcal{L}}{\partial z_s} = (z_s - z_t)

This matters enormously.

The predictor introduces a learnable Jacobian between alignment pressure and representation space.

As a result:

alignment pressure is absorbed by the predictor,
representation space is no longer pulled directly toward collapse,
collapse becomes dynamically unstable.

The trivial solution still exists — but SGD no longer flows into it.

7. Collapse as an unstable fixed point

Theoretical analyses (e.g. Tian et al.) show:

Without a predictor:
collapse is a stable fixed point
With a predictor:
collapse remains a solution
but becomes unstable
non-trivial representations become attractors

This is rare.

Most regularizers change where minima are.

The predictor changes how optimization moves.

Same loss.

Different dynamics.

8. Why this matters for P¬P

P¬P explicitly models an irreducible uncertainty band.

Early in training:

uncertainty must be preserved,
representations must not prematurely commit,
geometry must remain flexible.

Direct alignment collapses this band too early.

The predictor delays commitment.

It allows:

uncertainty to remain extended,
variance forces to act globally,
repulsion to sculpt topology over time.

In P¬P terms:

The predictor preserves epistemic uncertainty during optimization.

9. Objectives shape geometry. Predictors shape trajectories.

This leads to a clean separation of roles:

Component	Role
Alignment	Enforces invariance
Variance / covariance	Enforces global spread
Repulsion	Enforces inter-sample topology
Predictor	Destabilizes collapse during training

The predictor is not a hack. It is an admission. Optimization dynamics matter as much as objectives.

10. Takeaway

Entropy-based SSL fails because it confuses statistics with geometry.

Non-contrastive SSL collapses because it ignores dynamics.

P¬P requires both:

explicit geometric constraints,
explicit optimization asymmetry.

The predictor is not about learning better features.

It is about making sure learning happens at all.