Back to brain

Att + GMLP Isn’t Enough


1. The problem we hit

The P¬P model started from a clear architectural intuition:

Alternate attention (ATT) and generalized MLP (GMLP) blocks to progressively eliminate uncertainty in a P vs ¬P decision.

Architecturally, this makes sense:

Yet empirically, something was off.

Despite increasing depth, tuning heads, and stabilizing training, Att + GMLP alone plateaued. The model converged, but not toward better general representations — especially when tested under distributional pressure.

This raised a deeper question:

Is the limitation architectural, or objective-level?

That question is what led to SSL.


2. Why architecture alone is insufficient

Binary classifiers trained purely supervised learn decision boundaries, not representation geometry.

Formally, with labels y{0,1}y \in \{0,1\}, supervised training optimizes:

minθ  E(x,y)[L(fθ(x),y)]\min_\theta \; \mathbb{E}_{(x,y)} \left[ \mathcal{L}(f_\theta(x), y) \right]

This only constrains:

It says nothing about:

As a result, even strong architectures can learn locally correct but globally fragile representations.

This is fatal for P¬P, whose entire philosophy is to explicitly reason about uncertainty rather than hide it behind a sigmoid.


3. Why Self-Supervised Learning (SSL) was the natural next step

SSL adds a missing constraint:

The representation space itself must be structured, not just the classifier head.

We chose a DINO-style teacher–student setup because it aligns philosophically with P¬P:

Core idea (DINO-style)

Given multiple views vi(x)v_i(x) of the same image x:

Loss (simplified):

L=ipt(i)logps(i)\mathcal{L} = - \sum_i p_t^{(i)} \cdot \log p_s^{(i)}

With temperature scaling and centering, this avoids trivial uniform solutions.

The hope was clear:

If P¬P learns good geometry first, binary supervision will only need to “read out” the decision.


4. What we expected to happen

The intuition was:

  1. SSL would maximize entropy early, spreading representations.
  2. Attention + GMLP would discover invariant structure.
  3. Fine-tuning would:
    • collapse entropy selectively,
    • sharpen P vs ¬P boundaries,
    • outperform purely supervised training.

In short:

SSL → structure first, decision later


5. What actually happened (results)

5.1 SSL dynamics (the good news)

Initially, SSL worked:

This proved something important:

The SSL machinery itself is correct.

The architecture can learn non-trivial geometry.


5.2 Then collapse (the bad news)

After a few epochs:

We implemented:

Yet collapse kept returning.

This is not a bug — it’s a fixed point.


6. The critical insight: entropy ≠ separation

Here’s the key realization:

High entropy does not imply large inter-sample distances.

DINO (and similar SSL methods) ensure:

But they do not explicitly push different images apart.

Mathematically, nothing in the loss enforces:

zizj large for ij\| z_i - z_j \| \text{ large for } i \neq j

unless implicitly induced by data diversity.

So the model can settle into a state where:

This explains:


7. The decisive experiment: supervised baseline

When we compared against pure supervised training:

Supervised (no SSL)

SSL → fine-tune

This confirmed the uncomfortable truth:

For this dataset, SSL hurt performance.

Not because SSL is bad — but because the geometry it learned was misaligned with the task.


8. What this means for P¬P

This result is not a failure of the project.

It tells us something fundamental:

P¬P cannot rely on entropy alone to shape representation space.

If uncertainty is central, then distance must be explicit.


9. Where this leads (next blog)

The next step is clear:

Explicit separation in SSL

Instead of hoping entropy creates separation, we will force it.

Upcoming directions:

In other words:

Uncertainty must be carved, not inferred.

That’s what the next post will be about.


10. Final takeaway

Attention + GMLP isn’t enough — not because it’s weak, but because objectives define geometry.

SSL showed us:

This is exactly the kind of negative result that moves theory forward.