Att + GMLP Isn’t Enough

1. The problem we hit

The P¬P model started from a clear architectural intuition:

Alternate attention (ATT) and generalized MLP (GMLP) blocks to progressively eliminate uncertainty in a P vs ¬P decision.

Architecturally, this makes sense:

Attention isolates where information lives.
GMLP aggregates what information matters.
Iteration should narrow the uncertainty band.

Yet empirically, something was off.

Despite increasing depth, tuning heads, and stabilizing training, Att + GMLP alone plateaued. The model converged, but not toward better general representations — especially when tested under distributional pressure.

This raised a deeper question:

Is the limitation architectural, or objective-level?

That question is what led to SSL.

2. Why architecture alone is insufficient

Binary classifiers trained purely supervised learn decision boundaries, not representation geometry.

Formally, with labels $y \in \{0,1\}$ , supervised training optimizes:

\min_\theta \; \mathbb{E}_{(x,y)} \left[ \mathcal{L}(f_\theta(x), y) \right]

This only constrains:

points near the boundary,
along directions relevant to the labels.

It says nothing about:

distances between samples of the same class,
distances across classes away from the boundary,
geometry of ambiguous or uncertain samples.

As a result, even strong architectures can learn locally correct but globally fragile representations.

This is fatal for P¬P, whose entire philosophy is to explicitly reason about uncertainty rather than hide it behind a sigmoid.

3. Why Self-Supervised Learning (SSL) was the natural next step

SSL adds a missing constraint:

The representation space itself must be structured, not just the classifier head.

We chose a DINO-style teacher–student setup because it aligns philosophically with P¬P:

No labels required.
Uses entropy and agreement, not class supervision.
Encourages structure without collapsing to trivial solutions (in theory).

Core idea (DINO-style)

Given multiple views $v_i(x)$ of the same image x:

A teacher produces a probability distribution p_t.
A student produces $p_s$ .
The student is trained to match the teacher across views.

Loss (simplified):

\mathcal{L} = - \sum_i p_t^{(i)} \cdot \log p_s^{(i)}

With temperature scaling and centering, this avoids trivial uniform solutions.

The hope was clear:

If P¬P learns good geometry first, binary supervision will only need to “read out” the decision.

4. What we expected to happen

The intuition was:

SSL would maximize entropy early, spreading representations.
Attention + GMLP would discover invariant structure.
Fine-tuning would:
- collapse entropy selectively,
- sharpen P vs ¬P boundaries,
- outperform purely supervised training.

In short:

SSL → structure first, decision later

5. What actually happened (results)

5.1 SSL dynamics (the good news)

Initially, SSL worked:

Loss escaped the trivial $\log(D)$ plateau.
Teacher and student entropies diverged meaningfully.
Representation variance (std_logits) increased.
Linear probes showed some signal (AUC ≈ 0.57–0.58).

This proved something important:

The SSL machinery itself is correct.

The architecture can learn non-trivial geometry.

5.2 Then collapse (the bad news)

After a few epochs:

Entropy started rising again.
Student variance collapsed.
Teacher and student distributions re-aligned.
Loss increased toward a high-entropy regime.

We implemented:

temperature warm-ups,
center freezing,
collapse guards,
best-checkpoint selection,
linear probe monitoring.

Yet collapse kept returning.

This is not a bug — it’s a fixed point.

6. The critical insight: entropy ≠ separation

Here’s the key realization:

High entropy does not imply large inter-sample distances.

DINO (and similar SSL methods) ensure:

agreement between views of the same image,
non-collapse via entropy.

But they do not explicitly push different images apart.

Mathematically, nothing in the loss enforces:

\| z_i - z_j \| \text{ large for } i \neq j

unless implicitly induced by data diversity.

So the model can settle into a state where:

representations are non-uniform,
entropy is high,
but all samples live in a narrow manifold.

This explains:

oscillating entropy,
repeated collapse,
weak downstream performance.

7. The decisive experiment: supervised baseline

When we compared against pure supervised training:

Supervised (no SSL)

Accuracy ≈ 94%
Youden J ≈ 0.88
Strong precision/recall balance

SSL → fine-tune

Accuracy ≈ 76%
Youden J ≈ 0.55

This confirmed the uncomfortable truth:

For this dataset, SSL hurt performance.

Not because SSL is bad — but because the geometry it learned was misaligned with the task.

8. What this means for P¬P

This result is not a failure of the project.

It tells us something fundamental:

P¬P cannot rely on entropy alone to shape representation space.

If uncertainty is central, then distance must be explicit.

9. Where this leads (next blog)

The next step is clear:

Explicit separation in SSL

Instead of hoping entropy creates separation, we will force it.

Upcoming directions:

Contrastive or repulsive terms.
Explicit inter-sample distance maximization.
Three-region geometry:
P
¬P
explicit uncertainty band.

In other words:

Uncertainty must be carved, not inferred.

That’s what the next post will be about.

10. Final takeaway

Attention + GMLP isn’t enough — not because it’s weak, but because objectives define geometry.

SSL showed us:

the architecture can learn structure,
entropy prevents trivial collapse,
but without explicit repulsion, representations stagnate.

This is exactly the kind of negative result that moves theory forward.