GMLP, Solving the Uncertainty issue

1. Ontology of a P¬P problem

Let

$X$ be a set of possible inputs (images, texts, vectors, multimodal tensors).
( $\Omega,\mathcal{F},\mathbb{P}$ ) a probability space generating samples $x \in X$ .
$Y \in \{0,1\}$ a binary label with semantics

Y=1 \;\text{means}\; P,\qquad Y=0 \;\text{means}\; \neg P.

We assume the existence of a (possibly unknown) Bayes–optimal scoring function

f^\star : X \to \mathbb{R}

such that

\sigma(f^\star(x)) = \mathbb{P}(P \mid x),

with \sigma the logistic sigmoid.

A P¬P model is any measurable $f : X \to \mathbb{R}$ intended as an approximation of $f^\star$ , interpreted through $\sigma \circ f$ .

2. Uncertainty band and the three-valued semantics

The key move is to not read $\sigma(f(x))$ as merely “confidence”, but to introduce a structural third region.

Fix parameters $\alpha, \beta \in (0, \tfrac{1}{2})$ and define:

Positive region (decide P)
Negative region (decide \neg P)
Uncertainty band

We read these as a three-valued semantics for the model:

$x \in R_P$ : the system asserts $P$ .
$x \in R_{\neg P}$ : the system asserts $\neg P$ .
$x \in U$ : the system withholds commitment and returns the “indeterminate” value.

The point of P¬P is:

A “better” model is one that, subject to constraints on error, reduces the measure of the uncertainty band $U$ , rather than simply pushing more mass toward 1.

Formally, we would like to make small:

\mu(U) := \mathbb{P}(X \in U)

while controlling false positives and false negatives:

$\alpha_{FP} := \mathbb{P}(X \in R_P, Y=0),\qquad$
$\alpha_{FN} := \mathbb{P}(X \in R_{\neg P}, Y=1).$

3. Characteristics vs properties

Let a characteristic be any measurable subset $C \subseteq X$ that is encoded in the data (e.g. “has a bright blob in the upper-left patch”, “has a particular phrase”, etc.).

Crucial distinction:

A property is semantic: $P or \neg P.$
A characteristic $C$ is only a pattern in the input.

Nothing prevents a characteristic from satisfying something like

C \subseteq P \cup U, \quad C \cap \neg P = \emptyset,

i.e. whenever $C$ appears, the label is either $P$ or “ambiguous”, but never safely $\neg P$ .

If the model naïvely overweights such a $C$ , many points are pushed toward the middle of the sigmoid, inflating $U$ .

Thus the C–problem:

Some characteristics are “uncertainty-generating”: they correlate with $P\cup U$ rather than cleanly with $P$ or $\neg P$ .

The architecture must be able to represent such Cs (to avoid blind spots) but not allow them to dominate the decision boundary.

4. Representation operators: ATT and GMLP

We assume a tokenization of inputs:

For each $x \in X$ , we obtain a finite multiset of tokens

4.1 Attention operator

Define an attention operator

ATT : (\mathbb{R}^d)^m \to (\mathbb{R}^d)^m

of the familiar form (abstractly):

$K=K(T), Q=Q(T), V=V(T)$ linear in $T$ ,
$ATT(T) = \text{softmax}\big(\frac{QK^\top}{\sqrt{d}} + M\big)V$ ,

possibly with multiple heads and layers, but conceptually:

ATT contextualizes tokens: each token is updated based on all others.

Thus ATT expands purely local characteristics into structured relational information.

4.2 General MLP (GMLP)

Define a shared per-token operator

g_\theta : \mathbb{R}^d \to \mathbb{R}^{d'}

and lift it to token sets:

GMLP(T) = (g_\theta(t_1),\dots,g_\theta(t_m)).

Think of $g_\theta$ as a semantic filter:

It learns to suppress embeddings corresponding to uncertainty-generating Cs,
and to amplify embeddings that contribute to sharp $P / \neg P$ separation.

The crucial constraint is sharing: the same g_\theta applies to all tokens, regardless of their origin. This is what gives P¬P its “general” character: one operator for “what counts as evidence” across all tokens.

5. The P¬P operator: structural definition

We define a sequence of representations:

$Z_0(x) := T(x)$ (initial tokens).
For $k = 0,\dots,K-1$ ,

So we alternate local purification (GMLP) and global contextualization (ATT).

After K steps:

Tokens $Z_K(x) = (z_1,\dots,z_m)$ are aggregated via a permutation-invariant pooling operator POOL:
A final scalar head $h : \mathbb{R}^{d''} \to \mathbb{R}$ produces a logit:

Definition (P¬P operator).

An instance of the P¬P model class is any map $f : X \to \mathbb{R}$ realizable as

f = h \circ POOL \circ (ATT \circ GMLP)^K \circ T

for some integer $K \ge 1$ and parameters of T, ATT, GMLP, POOL, h.

6. Learning objective in terms of P, ¬P, and the band

We distinguish three “target behaviors”:

Positive behavior: for x whose true state is safely P, we want
$$
\hat{p}(x) \to 1.
$$
Negative behavior: for safely \neg P,
$$
\hat{p}(x) \to 0.
$$
Ambiguous behavior: for inherently “C-like” cases that belong neither cleanly to P nor \neg P, we want
$$
\hat{p}(x) \to 0.5.
$$

The architecture plus training criterion induce a trade-off:

We do not want more and more points thrown into ambiguity;
we want the class of truly ambiguous cases to be represented but minimal.

In abstract terms, the learning problem is:

Find parameters such that, for a suitable loss functional L,

\min_f L(f) \quad \text{subject to} \quad

\alpha_{FP}, \alpha_{FN} \le \epsilon

where L includes a term punishing

deviation from 1 or 0 on clear $P / \neg P$ points, and
deviation from 0.5 on deliberately designated ambiguous points, plus
the size (measure) of the induced uncertainty band $U$ .

This embeds the philosophy:

“Train the model so that uncertainty is both explicit and rare.”

7. Philosophical rationale of the ATT–GMLP alternation

Conceptually:

ATT is expansive: it propagates all characteristics, including “bad” Cs, across the representation space. It embodies the assumption that anything might relate to anything.
GMLP is contractive: it performs a kind of semantic projection, discarding or underweighting those aspects of tokens that systematically fail to discriminate $P$ from $\neg P$ and instead feed the band.

By alternating them,

Z_{k+1} = GMLP(ATT(Z_k)),

we realize a sort of dialectic:

Spread all available information (including misleading or ambiguous cues).
Subject that expanded representation to a global notion of “what counts as evidence”, shared across tokens.
Repeat, gradually purifying the space of characteristics so that what survives multiple rounds is increasingly aligned with the $P / \neg P$ distinction rather than with $P \cup U$ .

The final MLP head does not “discover” $P \text{ vs } \neg P$ from scratch; it merely draws a scalar cut in a space already filtered by the repeated ATT–GMLP interplay.

In this view, the P¬P operator is:

a universal schema for binary judgement,
explicitly three-valued in its semantics ( $P\text{, } \neg P$ , indeterminate),
and architecturally committed to keeping the third value small but principled, not merely a byproduct of calibration failure.

Results

Results: Diabetic Retinopathy

Results: Pneumonia

Results: Tuberculosis

Results: Brain MRI