GMLP, Solving the Uncertainty issue
1. Ontology of a P¬P problem
Let
- be a set of possible inputs (images, texts, vectors, multimodal tensors).
- () a probability space generating samples .
- a binary label with semantics
We assume the existence of a (possibly unknown) Bayes–optimal scoring function
such that
with \sigma the logistic sigmoid.
A P¬P model is any measurable intended as an approximation of , interpreted through .
2. Uncertainty band and the three-valued semantics
The key move is to not read as merely “confidence”, but to introduce a structural third region.
Fix parameters and define:
- Positive region (decide P)
- Negative region (decide \neg P)
- Uncertainty band
We read these as a three-valued semantics for the model:
- : the system asserts .
- : the system asserts .
- : the system withholds commitment and returns the “indeterminate” value.
The point of P¬P is:
A “better” model is one that, subject to constraints on error, reduces the measure of the uncertainty band , rather than simply pushing more mass toward 1.
Formally, we would like to make small:
while controlling false positives and false negatives:
3. Characteristics vs properties
Let a characteristic be any measurable subset that is encoded in the data (e.g. “has a bright blob in the upper-left patch”, “has a particular phrase”, etc.).
Crucial distinction:
- A property is semantic:
- A characteristic is only a pattern in the input.
Nothing prevents a characteristic from satisfying something like
i.e. whenever appears, the label is either or “ambiguous”, but never safely .
If the model naïvely overweights such a , many points are pushed toward the middle of the sigmoid, inflating .
Thus the C–problem:
Some characteristics are “uncertainty-generating”: they correlate with rather than cleanly with or .
The architecture must be able to represent such Cs (to avoid blind spots) but not allow them to dominate the decision boundary.
4. Representation operators: ATT and GMLP
We assume a tokenization of inputs:
- For each , we obtain a finite multiset of tokens
4.1 Attention operator
Define an attention operator
of the familiar form (abstractly):
- linear in ,
- ,
possibly with multiple heads and layers, but conceptually:
ATT contextualizes tokens: each token is updated based on all others.
Thus ATT expands purely local characteristics into structured relational information.
4.2 General MLP (GMLP)
Define a shared per-token operator
and lift it to token sets:
Think of as a semantic filter:
- It learns to suppress embeddings corresponding to uncertainty-generating Cs,
- and to amplify embeddings that contribute to sharp separation.
The crucial constraint is sharing: the same g_\theta applies to all tokens, regardless of their origin. This is what gives P¬P its “general” character: one operator for “what counts as evidence” across all tokens.
5. The P¬P operator: structural definition
We define a sequence of representations:
- (initial tokens).
- For ,
So we alternate local purification (GMLP) and global contextualization (ATT).
After K steps:
- Tokens are aggregated via a permutation-invariant pooling operator POOL:
- A final scalar head produces a logit:
Definition (P¬P operator).
An instance of the P¬P model class is any map realizable as
for some integer and parameters of T, ATT, GMLP, POOL, h.
6. Learning objective in terms of P, ¬P, and the band
We distinguish three “target behaviors”:
- Positive behavior: for x whose true state is safely P, we want
$$
\hat{p}(x) \to 1.
$$
- Negative behavior: for safely \neg P,
$$
\hat{p}(x) \to 0.
$$
- Ambiguous behavior: for inherently “C-like” cases that belong neither cleanly to P nor \neg P, we want
$$
\hat{p}(x) \to 0.5.
$$
The architecture plus training criterion induce a trade-off:
- We do not want more and more points thrown into ambiguity;
- we want the class of truly ambiguous cases to be represented but minimal.
In abstract terms, the learning problem is:
Find parameters such that, for a suitable loss functional L,
where L includes a term punishing
- deviation from 1 or 0 on clear points, and
- deviation from 0.5 on deliberately designated ambiguous points, plus
- the size (measure) of the induced uncertainty band .
This embeds the philosophy:
“Train the model so that uncertainty is both explicit and rare.”
7. Philosophical rationale of the ATT–GMLP alternation
Conceptually:
- ATT is expansive: it propagates all characteristics, including “bad” Cs, across the representation space. It embodies the assumption that anything might relate to anything.
- GMLP is contractive: it performs a kind of semantic projection, discarding or underweighting those aspects of tokens that systematically fail to discriminate from and instead feed the band.
By alternating them,
we realize a sort of dialectic:
- Spread all available information (including misleading or ambiguous cues).
- Subject that expanded representation to a global notion of “what counts as evidence”, shared across tokens.
- Repeat, gradually purifying the space of characteristics so that what survives multiple rounds is increasingly aligned with the distinction rather than with .
The final MLP head does not “discover” from scratch; it merely draws a scalar cut in a space already filtered by the repeated ATT–GMLP interplay.
In this view, the P¬P operator is:
- a universal schema for binary judgement,
- explicitly three-valued in its semantics (, indeterminate),
- and architecturally committed to keeping the third value small but principled, not merely a byproduct of calibration failure.