Entropy’s Flaw in SSL of P¬P Model

Note: For implicit maximization of distances of semantically different images there must be a critical scaling point of the model where due to entropy it is necessary that the model in SSL doesn’t end up with trivial solutions (like all-zero weights) or not extracting useful characteristics. Since we don’t count with compute capabilities to assume entropic behavior, then we must make the maximization explicit. This article is about that.

1. The hidden assumption behind entropy-based SSL

Most modern self-supervised learning (SSL) methods rely on a quiet but powerful assumption:

If representations have high entropy, meaningful separation will emerge implicitly.

This assumption underlies methods like DINO, BYOL, SimSiam, and their variants. They differ in mechanics, but share the same belief:

Align different views of the same sample.
Prevent trivial collapse via entropy, centering, or variance heuristics.
Let geometry emerge.

For many benchmarks, this works well enough.

For P¬P, it does not.

2. Why entropy is insufficient for P¬P

The P¬P framework is not merely about classification accuracy. It is about explicitly reasoning over uncertainty.

Formally, P¬P assumes that the input space is not binary but tri-partitioned:

$P$ — clearly positive
$¬P$ — clearly negative
$U = Pᶜ \ {¬P}$ — an irreducible uncertainty band

The goal is not to erase this band, but to minimize its measure without collapsing it.

Entropy-based SSL fails here for a subtle but fundamental reason:

High entropy does not imply large inter-sample distances.

A model can satisfy all entropy constraints while mapping all samples into a narrow, low-curvature manifold. Representations remain non-uniform, loss stays finite, but distances between semantically different samples are small.

From the perspective of $P¬P$ , this is catastrophic:

Uncertainty is not carved.
The uncertainty band is compressed, not reduced.
Downstream classifiers must invent separation that the representation never learned.

Likewise, for implicit maximization of distances of semantically different images there must be a critical scaling point of the model where due to entropy it is necessary that the model in SSL doesn’t end up with trivial solutions or not extracting useful characteristics. Since we don’t count with compute capabilities to assume entropic behavior, then we must make the maximization explicit. This article is about that.

3. Why “implicit repulsion” is a fragile hope

Contrastive-free SSL often justifies itself by appeal to implicit competition:

Normalization constrains space.
Finite batch sizes induce competition.
Alignment forces trade-offs.

This works only if the model reaches a critical regime where entropy and normalization force separation.

That regime requires:

sufficient batch size,
sufficient dimensional pressure,
sufficient optimization stability,
sufficient compute.

In our setting, we do not have the luxury to assume this regime will be reached.

So the implicit strategy becomes a gamble:

“If entropy is high enough, geometry will sort itself out.”

For P¬P, this gamble repeatedly failed.

4. The decisive shift: from entropy to geometry

The conclusion was unavoidable:

If separation matters, it must be enforced explicitly.

Rather than relying on entropy to incidentally spread representations, we redesigned SSL around a different principle:

Alignment must be local; dispersion must be global.

This required abandoning probability-space SSL (softmax, temperature schedules, centering) and working directly in embedding space.

5. P¬P SSL: making distance explicit

The resulting method — P¬P SSL — operates on L2-normalized embeddings, not probability distributions.

5.1 Alignment (P force)

As before, different augmentations of the same image must agree.

For student embedding $z_s$ and teacher embedding $z_t$ :

\mathcal{L}_{\text{align}} = \mathbb{E}[1 - z_s \cdot z_t]

This preserves invariances and anchors representation identity.

5.2 Variance constraint (anti-collapse)

To prevent collapse and enforce global spread, we require each embedding dimension to maintain sufficient variance:

\mathcal{L}_{\text{var}} = \frac{1}{d} \sum_k \max(0, \gamma - \mathrm{std}(z_k))^2

This is not entropy.

It is a geometric floor.

5.3 Covariance constraint (decorrelation)

To avoid degenerate axes and correlated collapse:

\mathcal{L}_{\text{cov}} = \frac{1}{d} \sum_{i \neq j} \mathrm{Cov}(z_i, z_j)^2

This forces information to distribute across dimensions rather than hiding in a few.

5.4 Soft repulsion

Optionally, embeddings that become too similar are softly pushed apart:

\mathcal{L}_{\text{rep}} = \mathbb{E}_{i \neq j} \left[ \max(0, z_i \cdot z_j - \delta)^2 \right]

Not contrastive. Not binary. Just a guardrail.

Turns out this is one repulsive loss force prevents collapse a lot. Variance and decorrelation alone prevent trivial collapse, but do not prevent semantic collapse. Without explicit inter-sample forces, the model repeatedly converges to a low-variance equilibrium that satisfies alignment while paying a constant variance penalty. This falsifies the hypothesis that geometry alone is sufficient. Soft repulsion provides the missing topological pressure and stabilizes variance for many epochs.

5.5 The full objective

\mathcal{L}_{\text{P¬P SSL}} = \mathcal{L}_{\text{align}} + \lambda_{\text{var}} \mathcal{L}_{\text{var}} + \lambda_{\text{cov}} \mathcal{L}_{\text{cov}} + \lambda_{\text{rep}} \mathcal{L}_{\text{rep}}

This objective removes collapse as a fixed point by construction.