Why Collapse Is an Optimization Problem (we just added a predictor)
Note: Made and developed in parallel with “Entropy’s Flaw”
In the previous post, we argued that entropy-based self-supervised learning (SSL) fails for P¬P because entropy is a statistical constraint, not a geometric one. High entropy does not guarantee meaningful distances, and uncertainty bands collapse silently even when losses remain finite.
That critique lives at the level of objectives.
This post addresses a different, orthogonal failure mode, even though both developments happened in parrallel:
Even with a well-designed geometric objective, training can still collapse — because collapse can be a stable fixed point of optimization
This distinction matters.
And it is exactly where the predictor head enters the picture.
1. Collapse is not just a bad solution — it can be a stable attractor
Consider a non-contrastive SSL setup:
- Two augmented views of the same sample
- A student embedding
- A teacher embedding (EMA or stop-gradient)
- An alignment loss in embedding space
The collapsed solution
is trivial but obvious.
What is less obvious — and more dangerous — is this:
Collapse is often not only a minimum, but dynamically stable one
That means:
- Small perturbations decay
- SGD trajectories flow toward collapse
- Training “works” until representations lose rank
This is why collapse often appears late, quietly, and irreversibly.
2. Why geometric regularizers alone are not enough
In P¬P SSL, we explicitly add:
- variance floors,
- covariance penalties,
- optional soft repulsion.
These remove collapse as a global optimum.
But this does not automatically remove collapse as a stable trajectory.
Empirically, we observed the following failure mode repeatedly:
- Alignment pulls representations together
- Variance penalties hold them barely apart
- The system converges to a low-rank, low-curvature equilibrium
- Loss remains finite
- Geometry is unusable
Formally, the system settles into a narrow manifold that satisfies all constraints but preserves almost no semantic distance.
This revealed a deeper issue:
Preventing collapse as a solution is not the same as preventing collapse as a training outcome
3. The missing piece: optimization asymmetry
Most non-contrastive SSL methods share a structural asymmetry:
- The student is updated by gradients
- The teacher is not (EMA or stop-gradient)
However, if the student embedding is matched directly to the teacher embedding, this asymmetry is insufficient.
The reason is subtle but fundamental:
Direct alignment creates a symmetric gradient flow in representation space, even if the networks are not identical.
This symmetry restores collapse as an attractor.
4. Enter the predictor
The predictor is a small network placed only on the student side, after the projector:
The loss becomes:
Crucially:
- The teacher never sees q
- The teacher embedding is the target
- The student embedding is not directly aligned
The predictor is usually:
- shallow,
- low-capacity,
- sometimes even linear.
This tells us something important.
5. What the predictor does not do
The predictor:
- does not add representational power,
- does not encode semantics,
- does not increase expressivity.
If it did, a linear predictor would fail.
But linear predictors work.
So its role must be different.
6. What the predictor actually does: reshape gradient flow
With a predictor, gradients propagate as:
Instead of:
This matters enormously.
The predictor introduces a learnable Jacobian between alignment pressure and representation space.
As a result:
- alignment pressure is absorbed by the predictor,
- representation space is no longer pulled directly toward collapse,
- collapse becomes dynamically unstable.
The trivial solution still exists — but SGD no longer flows into it.
7. Collapse as an unstable fixed point
Theoretical analyses (e.g. Tian et al.) show:
- Without a predictor:
- collapse is a stable fixed point
- With a predictor:
- collapse remains a solution
- but becomes unstable
- non-trivial representations become attractors
This is rare.
Most regularizers change where minima are.
The predictor changes how optimization moves.
Same loss.
Different dynamics.
8. Why this matters for P¬P
P¬P explicitly models an irreducible uncertainty band.
Early in training:
- uncertainty must be preserved,
- representations must not prematurely commit,
- geometry must remain flexible.
Direct alignment collapses this band too early.
The predictor delays commitment.
It allows:
- uncertainty to remain extended,
- variance forces to act globally,
- repulsion to sculpt topology over time.
In P¬P terms:
The predictor preserves epistemic uncertainty during optimization.
9. Objectives shape geometry. Predictors shape trajectories.
This leads to a clean separation of roles:
| Component | Role |
|---|---|
| Alignment | Enforces invariance |
| Variance / covariance | Enforces global spread |
| Repulsion | Enforces inter-sample topology |
| Predictor | Destabilizes collapse during training |
The predictor is not a hack. It is an admission. Optimization dynamics matter as much as objectives.
10. Takeaway
Entropy-based SSL fails because it confuses statistics with geometry.
Non-contrastive SSL collapses because it ignores dynamics.
P¬P requires both:
- explicit geometric constraints,
- explicit optimization asymmetry.
The predictor is not about learning better features.
It is about making sure learning happens at all.