Back to brain

GMLP, Solving the Uncertainty issue

image.png

1. Ontology of a P¬P problem

Let

Y=1  means  P,Y=0  means  ¬P.Y=1 \;\text{means}\; P,\qquad Y=0 \;\text{means}\; \neg P.

We assume the existence of a (possibly unknown) Bayes–optimal scoring function

f:XRf^\star : X \to \mathbb{R}

such that

σ(f(x))=P(Px),\sigma(f^\star(x)) = \mathbb{P}(P \mid x),

with \sigma the logistic sigmoid.

A P¬P model is any measurable f:XRf : X \to \mathbb{R} intended as an approximation of ff^\star, interpreted through σf\sigma \circ f.


2. Uncertainty band and the three-valued semantics

The key move is to not read σ(f(x))\sigma(f(x)) as merely “confidence”, but to introduce a structural third region.

Fix parameters α,β(0,12)\alpha, \beta \in (0, \tfrac{1}{2}) and define:

We read these as a three-valued semantics for the model:

The point of P¬P is:

A “better” model is one that, subject to constraints on error, reduces the measure of the uncertainty band UU, rather than simply pushing more mass toward 1.

Formally, we would like to make small:

μ(U):=P(XU)\mu(U) := \mathbb{P}(X \in U)

while controlling false positives and false negatives:

αFP:=P(XRP,Y=0),\alpha_{FP} := \mathbb{P}(X \in R_P, Y=0),\qquad
αFN:=P(XR¬P,Y=1).\alpha_{FN} := \mathbb{P}(X \in R_{\neg P}, Y=1).


3. Characteristics vs properties

Let a characteristic be any measurable subset CXC \subseteq X that is encoded in the data (e.g. “has a bright blob in the upper-left patch”, “has a particular phrase”, etc.).

Crucial distinction:

Nothing prevents a characteristic from satisfying something like

CPU,C¬P=,C \subseteq P \cup U, \quad C \cap \neg P = \emptyset,

i.e. whenever CC appears, the label is either PP or “ambiguous”, but never safely ¬P\neg P.

If the model naïvely overweights such a CC, many points are pushed toward the middle of the sigmoid, inflating UU.

Thus the C–problem:

Some characteristics are “uncertainty-generating”: they correlate with PUP\cup U rather than cleanly with PP or ¬P\neg P.

The architecture must be able to represent such Cs (to avoid blind spots) but not allow them to dominate the decision boundary.


4. Representation operators: ATT and GMLP

We assume a tokenization of inputs:

4.1 Attention operator

Define an attention operator

ATT:(Rd)m(Rd)mATT : (\mathbb{R}^d)^m \to (\mathbb{R}^d)^m

of the familiar form (abstractly):

possibly with multiple heads and layers, but conceptually:

ATT contextualizes tokens: each token is updated based on all others.

Thus ATT expands purely local characteristics into structured relational information.

4.2 General MLP (GMLP)

Define a shared per-token operator

gθ:RdRdg_\theta : \mathbb{R}^d \to \mathbb{R}^{d'}

and lift it to token sets:

GMLP(T)=(gθ(t1),,gθ(tm)).GMLP(T) = (g_\theta(t_1),\dots,g_\theta(t_m)).

Think of gθg_\theta as a semantic filter:

The crucial constraint is sharing: the same g_\theta applies to all tokens, regardless of their origin. This is what gives P¬P its “general” character: one operator for “what counts as evidence” across all tokens.


5. The P¬P operator: structural definition

We define a sequence of representations:

So we alternate local purification (GMLP) and global contextualization (ATT).

After K steps:

Definition (P¬P operator).

An instance of the P¬P model class is any map f:XRf : X \to \mathbb{R} realizable as

f=hPOOL(ATTGMLP)KTf = h \circ POOL \circ (ATT \circ GMLP)^K \circ T

for some integer K1K \ge 1 and parameters of T, ATT, GMLP, POOL, h.


6. Learning objective in terms of P, ¬P, and the band

We distinguish three “target behaviors”:

  1. Positive behavior: for x whose true state is safely P, we want

    $$

    \hat{p}(x) \to 1.

    $$

  2. Negative behavior: for safely \neg P,

    $$

    \hat{p}(x) \to 0.

    $$

  3. Ambiguous behavior: for inherently “C-like” cases that belong neither cleanly to P nor \neg P, we want

    $$

    \hat{p}(x) \to 0.5.

    $$

The architecture plus training criterion induce a trade-off:

In abstract terms, the learning problem is:

Find parameters such that, for a suitable loss functional L,

minfL(f)subject to\min_f L(f) \quad \text{subject to} \quad αFP,αFNϵ\alpha_{FP}, \alpha_{FN} \le \epsilon

where L includes a term punishing

This embeds the philosophy:

“Train the model so that uncertainty is both explicit and rare.”


7. Philosophical rationale of the ATT–GMLP alternation

Conceptually:

By alternating them,

Zk+1=GMLP(ATT(Zk)),Z_{k+1} = GMLP(ATT(Z_k)),

we realize a sort of dialectic:

  1. Spread all available information (including misleading or ambiguous cues).
  2. Subject that expanded representation to a global notion of “what counts as evidence”, shared across tokens.
  3. Repeat, gradually purifying the space of characteristics so that what survives multiple rounds is increasingly aligned with the P/¬PP / \neg P distinction rather than with PUP \cup U.

The final MLP head does not “discover” P vs ¬PP \text{ vs } \neg P from scratch; it merely draws a scalar cut in a space already filtered by the repeated ATT–GMLP interplay.

In this view, the P¬P operator is:

Results

Results: Diabetic Retinopathy

Results: Pneumonia

Results: Tuberculosis

Results: Brain MRI