Att + GMLP Isn’t Enough
1. The problem we hit
The P¬P model started from a clear architectural intuition:
Alternate attention (ATT) and generalized MLP (GMLP) blocks to progressively eliminate uncertainty in a P vs ¬P decision.
Architecturally, this makes sense:
- Attention isolates where information lives.
- GMLP aggregates what information matters.
- Iteration should narrow the uncertainty band.
Yet empirically, something was off.
Despite increasing depth, tuning heads, and stabilizing training, Att + GMLP alone plateaued. The model converged, but not toward better general representations — especially when tested under distributional pressure.
This raised a deeper question:
Is the limitation architectural, or objective-level?
That question is what led to SSL.
2. Why architecture alone is insufficient
Binary classifiers trained purely supervised learn decision boundaries, not representation geometry.
Formally, with labels , supervised training optimizes:
This only constrains:
- points near the boundary,
- along directions relevant to the labels.
It says nothing about:
- distances between samples of the same class,
- distances across classes away from the boundary,
- geometry of ambiguous or uncertain samples.
As a result, even strong architectures can learn locally correct but globally fragile representations.
This is fatal for P¬P, whose entire philosophy is to explicitly reason about uncertainty rather than hide it behind a sigmoid.
3. Why Self-Supervised Learning (SSL) was the natural next step
SSL adds a missing constraint:
The representation space itself must be structured, not just the classifier head.
We chose a DINO-style teacher–student setup because it aligns philosophically with P¬P:
- No labels required.
- Uses entropy and agreement, not class supervision.
- Encourages structure without collapsing to trivial solutions (in theory).
Core idea (DINO-style)
Given multiple views of the same image x:
- A teacher produces a probability distribution p_t.
- A student produces .
- The student is trained to match the teacher across views.
Loss (simplified):
With temperature scaling and centering, this avoids trivial uniform solutions.
The hope was clear:
If P¬P learns good geometry first, binary supervision will only need to “read out” the decision.
4. What we expected to happen
The intuition was:
- SSL would maximize entropy early, spreading representations.
- Attention + GMLP would discover invariant structure.
- Fine-tuning would:
- collapse entropy selectively,
- sharpen P vs ¬P boundaries,
- outperform purely supervised training.
In short:
SSL → structure first, decision later
5. What actually happened (results)
5.1 SSL dynamics (the good news)
Initially, SSL worked:
- Loss escaped the trivial plateau.
- Teacher and student entropies diverged meaningfully.
- Representation variance (std_logits) increased.
- Linear probes showed some signal (AUC ≈ 0.57–0.58).
This proved something important:
The SSL machinery itself is correct.
The architecture can learn non-trivial geometry.
5.2 Then collapse (the bad news)
After a few epochs:
- Entropy started rising again.
- Student variance collapsed.
- Teacher and student distributions re-aligned.
- Loss increased toward a high-entropy regime.
We implemented:
- temperature warm-ups,
- center freezing,
- collapse guards,
- best-checkpoint selection,
- linear probe monitoring.
Yet collapse kept returning.
This is not a bug — it’s a fixed point.
6. The critical insight: entropy ≠ separation
Here’s the key realization:
High entropy does not imply large inter-sample distances.
DINO (and similar SSL methods) ensure:
- agreement between views of the same image,
- non-collapse via entropy.
But they do not explicitly push different images apart.
Mathematically, nothing in the loss enforces:
unless implicitly induced by data diversity.
So the model can settle into a state where:
- representations are non-uniform,
- entropy is high,
- but all samples live in a narrow manifold.
This explains:
- oscillating entropy,
- repeated collapse,
- weak downstream performance.
7. The decisive experiment: supervised baseline
When we compared against pure supervised training:
Supervised (no SSL)
- Accuracy ≈ 94%
- Youden J ≈ 0.88
- Strong precision/recall balance
SSL → fine-tune
- Accuracy ≈ 76%
- Youden J ≈ 0.55
This confirmed the uncomfortable truth:
For this dataset, SSL hurt performance.
Not because SSL is bad — but because the geometry it learned was misaligned with the task.
8. What this means for P¬P
This result is not a failure of the project.
It tells us something fundamental:
P¬P cannot rely on entropy alone to shape representation space.
If uncertainty is central, then distance must be explicit.
9. Where this leads (next blog)
The next step is clear:
Explicit separation in SSL
Instead of hoping entropy creates separation, we will force it.
Upcoming directions:
- Contrastive or repulsive terms.
- Explicit inter-sample distance maximization.
- Three-region geometry:
- P
- ¬P
- explicit uncertainty band.
In other words:
Uncertainty must be carved, not inferred.
That’s what the next post will be about.
10. Final takeaway
Attention + GMLP isn’t enough — not because it’s weak, but because objectives define geometry.
SSL showed us:
- the architecture can learn structure,
- entropy prevents trivial collapse,
- but without explicit repulsion, representations stagnate.
This is exactly the kind of negative result that moves theory forward.