When SSL Stops Working: Characteristics, Not Pixels

[pnotpp¬psslsegmentationrepresentation-learningcharacteristics]

date: 2026-02-06

0. Context

This iteration was a stress test of a simple thesis:

“SSL + fine-tuning should generalize across segmentation problems.”

It did for fire segmentation.

It did not for this new segmentation task (IoU plateau ~0.37–0.38 despite many interventions).

This post documents what we changed, what we observed, and what it implies for the next model iteration.

1. What the whiteboard formalizes

Let X be the input (image). The decision is not triggered by the raw input directly, but by latent characteristics inside it.

The problem is a ternary decision (P, ¬P, U), but underneath, any decision can be written as:

F(X) \approx Decision(f(X))

The key realization is: X itself does not imply the problem.

Rather:

C(X) \Rightarrow P_r

So the model must be decomposed into two functional roles:

Backbone / feature extractor:
Head / feature interpreter:

It decodes/interprets those characteristics into the task output (mask, label, boxes, etc.).

So the minimal abstraction is:

X \xrightarrow{c} \hat{C}(X) \xrightarrow{f} \hat{Y}

This iteration’s lesson: different problems require extracting different kinds of characteristics.

2. The characteristic taxonomy (why SSL “works” sometimes)

Not all characteristics are equal. At least three families show up:

Geometric (edges, shapes, spatial continuity)
Morphological / textural (local patterns, gradients, blobs)
Semantic / contextual (what the object is, not just how it looks locally)

Most SSL pretraining (especially vision-only, augmentation-driven) is disproportionately strong on (1) + (2).

Fire segmentation is dominated by (1)+(2): high-contrast, consistent textures, strong local cues.

This harder task leaks into (3): boundaries are ambiguous, cues are contextual, and “what counts” can depend on scene semantics.

So the failure is not “segmentation is hard”.

The failure is: the needed characteristics are not primarily geometric/morphological anymore.

3. What we actually did in this run (engineering log)

3.1 Pipeline

Loaded teacher SSL checkpoint (pnotp_teacher_best.npz)
Fine-tuned a segmentation head:
Head = coarse
PatchMix = ON
channels = 16

3.2 Class imbalance control

pos_weight = 2.0
Foreground prior estimated from data:
foreground prior ≈ 0.2921
initialized segmentation bias:

3.3 Loss shaping

We trained with a compound objective:

BCE (weighted)
Tversky ( $α=0.7, β=0.3$ )
Semantic alignment loss (small weight, w_sem = 0.02)
Plus regularization terms (TV-like smoothing)

3.4 Sampling / optimization tricks

Hard negative mining (HNM)
Threshold sweeps logged every epoch (precision/recall/FP across thresholds)

4. The observed behavior

4.1 IoU ceiling (the main fact)

Across many epochs, IoU oscillates and improves slightly, but does not break out:

Best IoU reaches ~0.3786
Dice tracks it (~0.54–0.55)
The model keeps swinging between two failure modes:
over-predicting foreground (high recall, huge FP)
under-predicting foreground (precision rises, recall collapses)

This is visible directly in logs like:

Val foreground true ≈ 0.2466
Val foreground pred swings from ~0.16 up to ~0.61 depending on epoch

So it’s not stable convergence to a better representation — it’s threshold/coverage instability.

4.2 Why this is not “just train longer”

Training longer helps when:

representation quality is still improving, and
the head is capacity-limiting.

Here, the behavior is different:

losses improve,
but the model remains fundamentally uncertain about what constitutes foreground,
which is a characteristic extraction problem, not an optimization schedule problem.

5. “Is it scale?” — How we can tell

When I say “scale”, I mean one of these (often multiple):

Data scale / diversity: not enough examples of the rare boundary cases
Label scale / quality: mask ambiguity or noisy annotation caps IoU
Model scale / capacity: backbone can’t represent the needed characteristic family
Pretraining scale / modality: SSL didn’t include the semantic signals required

What would count as evidence it’s scale (and not a bug)

You’re looking for this pattern:

Training is stable (no collapse, no NaNs)
Threshold curves improve slowly but saturate early
The model alternates FP vs FN regimes rather than becoming consistently correct
Improvements from loss shaping (pos_weight, Tversky, bias init) are incremental, not transformative

That’s exactly what happened.

So: yes, the ceiling looks structural.

6. Why fire segmentation was easy (scientific, concise)

Fire segmentation succeeds with the vanilla SSL→FT pipeline because the task is dominated by low-level, local, and consistent cues:

high signal-to-noise,
strong local texture + color/brightness patterns,
relatively coherent boundaries,
less dependence on scene context.

This new task appears to require higher-order / semantic disambiguation:

ambiguous boundaries (even humans disagree),
foreground definition depends on context,
local texture is insufficient to decide membership,
class imbalance makes FP/FN tradeoffs unstable.

So the difference is not “segmentation vs segmentation”.

It’s which characteristic family determines the label.

7. The implication for the next iteration

7.1 The new backbone question

How do we build a backbone that extracts all characteristic types?

A single SSL-trained vision backbone is not guaranteed to encode semantic/contextual characteristics.

So the next iteration likely needs one of:

multiple backbones (specialists per characteristic family)
task-conditioned backbone routing (choose which extractor dominates)
multimodal or semantic supervision (text/image alignment, prompts, weak labels)
multi-task pretraining (segmentation + classification + retrieval) to force semantic structure

7.2 The stable abstraction

Keep the two-piece model:

$c$ : characteristic extractor(s)
$f$ : characteristic interpreter (seg head / cls head / mapping head)

But stop assuming that SSL alone defines $c$ for every domain.

8. Takeaway

This iteration produced a clean conclusion:

SSL is great at extracting geometric/morphological characteristics, but some P¬P problems are governed by semantic/contextual characteristics — and those require either scale, modality, or architecture changes in the extractor.

Fire segmentation was a geometry problem.

This one is drifting into semantics.

So the next step is not “more loss hacks”.

It’s upgrading the characteristic extractor.