# A report on "Inferential Models", a framework for prior-free imprecise-probabilistic inference

Posted on June 6, 2021 by Ryan Martin (edited by Ignacio Montes)

[ go back to blog ]

## Abstract

I was delighted when Ignacio Montes, SIPTA’s Executive Editor, invited me to share some details here about the monograph,
*Inferential Models: Reasoning with Uncertainty*, co-authored by
Chuanhai Liu and
myself, and published by
Chapman & Hall/CRC Press back in 2015.
The book develops a new framework for statistical inference, one that assumes no prior, but returns (something like) a posterior probability distribution that
can be used for drawing inferences. Moreover, that probability-like output satisfies a strong calibration property, called *validity*, which, among other
things, implies that procedures derived from it provably control frequentist error rates. How is this related to imprecise probability? Indeed, this is not
a book about imprecise probability, but it turns out that the inferential model’s “probability-like” output must be imprecise in order to achieve the validity
property. Therefore, random sets, belief/plausibility functions, etc., play an important role in the development of this theory of statistical inference.

Below I first explain how the project originally got started—which is interesting in this case, given that we didn’t start in the world of imprecise probability but ended up there anyway. Next, I give some detail about what kind of results can be found in the book, and then about a few results that are missing from the book, i.e., relevant things that we know now but didn’t know then. Finally, I say a few words about what’s next, where my current attention is focused.

## How’d the book project get started?

As a PhD student, at Purdue University, I was trained in both classical/frequentist and Bayesian statistics, but I definitely had no formal training in imprecise probability, I wasn’t even aware of its existence. What set me on my long and indirect path to imprecise probability was when I learned about Fisher's fiducial inference [Efron1998] in an unlikely place: Chuanhai’s computational statistics course in Fall 2007.

Roughly, Fisher argued that a sort of “posterior distribution” could be derived from data and a statistical model alone, no need for a prior or Bayes’s theorem. Naturally, there was skepticism surrounding Fisher’s claims, and rightfully so: ultimately, the fiducial argument doesn’t work. While some say this is Fisher's biggest blunder, the many attempts to understand, refine, and/or debunk Fisher’s claims led to fundamental developments in statistics, most notably, Neyman's confidence intervals and Dempster's lower/upper probabilities, both of which are closely tied to imprecise probability. While Fisher’s attempt at posterior probabilistic inference without Bayes’s theorem fell short, this remains a sort of “Holy Grail” that, at least to some, may not be out of reach [Efron2013]:

…perhaps the most important unresolved problem in statistical inference is the use of Bayes theorem in the absence of prior information.

I was inspired by Chuanhai’s lectures on Fisher’s and Dempster’s theories, so we began discussing research in this direction. As a student, I didn’t know what I was getting into; I certainly didn’t know that we were starting a Grail quest. All I knew was that

- for various reasons, statisticians desire a sort of “posterior distribution” on which to base inference, but specifying a full prior distribution and using Bayes’s theorem is often too much of a burden, and
- both Fisher and Dempster offered good ideas for constructing a prior-free posterior but these were generally unsatisfactory to statisticians.

Our jumping off point was trying to write down directly what it is we want the inferential output to satisfy. The posterior probabilities themselves are the primitive, so the desired property should be focused on these rather than on procedures derived from them. Moreover, the property should be strong enough that it implies that procedures derived from the output provably control frequentist error rates. We eventually arrived at the following more specific version of Efron’s “most important unresolved problem”.

If “\(Y \sim \mathsf{P}_{Y|\theta}\)” is model for data \(Y\), where \(\theta\) is an unknown parameter in \(\Theta\) about which inference is sought, then the goal is to construct a “posterior distribution” \(\Pi_Y\), depending on data \(Y\), without requiring a prior or Bayes’s theorem, such that \[\label{eq}\sup_{\theta \not\in A} \mathsf{P}_{Y|\theta}\{\Pi_Y(A) > 1-\alpha\} \leq \alpha \quad \forall \; \alpha \in [0,1], \quad \forall \; A \subseteq \Theta. \tag{$\star$}\]

In words, our aim was to construct something like a posterior distribution such that the “probability” assigned by it to any false hypothesis being large is
a rare event with respect to the posited model \(Y \sim \mathsf{P}_{Y|\theta}\). It can be shown that, e.g., the derived test that
rejects a hypothesis “\(H_0: \theta \in A\)” if \(\Pi_Y(A^c) \geq 1-\alpha\) controls the
frequentist Type I error rate at level \(\alpha\). This goes beyond the construction of a procedure for testing
“\(H_0: \theta \in A\)” at a particular \(A\) because it actually returns an entire
“posterior distribution” from which valid inference about *any relevant feature of \(\phi(\theta)\) of
\(\theta\)* can be derived; this is an important consequence of the “for all \(A\)” part of
\eqref{eq}.

At the start, we had no reason to doubt that our intended target, \(\Pi_Y\), would be a precise probability distribution, but
we were open to go in whatever direction necessary. It didn’t take us long, however, to realize that *a condition like \eqref{eq} can’t be satisfied by
a precise probability distribution*. Ultimately, we realized that it can be achieved using suitably constructed *random sets*, leading to lower/upper probabilities
defined on \(\Theta\). The insights behind the construction of these random sets, how they’re converted to lower/upper
probabilities for quantifying uncertainty about \(\theta\), and what statistical properties this construction ensures was the
book’s genesis. These details started to get fleshed out during Chuanhai’s special topics course on the subject in Fall 2008.

## What’s in the book?

Problems in data science involve data (of course), models, prior information, and unknowns that are believed to be relevant to the phenomenon under investigation.
The data scientist’s goal is to convert this input into a meaningful quantification of uncertainty about the unknowns. An inferential model, or IM, is a mapping
from these inputs to a quantification of uncertainty about the unknowns. The book is mainly about how, in
Efron’s important special case of *no prior information*,
to construct this mapping so that the output is guaranteed to be valid in the sense of \eqref{eq}. Despite the efforts by
Bayes,
Fisher, and others to quantify
uncertainty using ordinary probability, our notion of validity is incompatible with this. Therefore, the IM’s output must be an imprecise probability and, in
particular, random sets play an important role in the book’s developments.

More specifically, we develop a framework that starts with a particular representation of the statistical model, in the form of what we call an *association* that
relates the observable data \(Y\) and unknown model parameter \(\theta\) with an unobservable auxiliary variable
\(U\) having a known distribution. In math, this is written as
\[ Y = a(\theta,U), \quad U \sim \mathsf{P}_U, \]
where \(\mathsf{P}_U\) is a known distribution. In the impractical case where the value \(u\) of
\(U\) were observed, along with data \(Y=y\), then one could solve the above equation for
\(\theta\) (perhaps giving a set of \(\theta\) values) and achieve the best possible inference, i.e.,
\[ \theta \in \Theta_y(u) := \{\vartheta \in \Theta: y=a(\vartheta,u)\}. \]
While \(U\) is actually unobservable, the fact that we know its distribution provides some opportunities. Following Fisher, Dempster,
and others, one could assume \(U \sim \mathsf{P}_U\) both before and *after* \(Y=y\) is observed, and
consider the random variable or random set \(\Theta_y(U)\), where \(U \sim \mathsf{P}_U\). However, the
resulting fiducial distribution, or lower/upper probabilities in Dempster’s case, do not achieve the desired validity property in \eqref{eq}. The problem is that,
even with the imprecision generally found in Dempster’s solution, the inferential output is still too precise. That is, more imprecision is required in order to
achieve \eqref{eq}. This is where the random sets mentioned above come in. Our idea was to introduce a random set \(\mathcal{S}\)
to quantify uncertainty about the *unobserved value of \(U\)*. This is different from Fisher/Dempster who quantify the uncertainty
about \(U\) the same before and after \(Y=y\) is observed. The motivation for this difference is
based on the above intuition, which says that what’s important is the actual value of \(U\) corresponding to the observation
\(Y=y\) and the unknown \(\theta\). From this perspective, it doesn’t make sense to quantify
uncertainty about a *particular realization* of a random variable using its a priori distribution. Our choice was to use a suitable random set
\(\mathcal{S}\) defined on the \(U\)-space and push it forward to a random set on the
\(\theta\)-space:
\[ \Theta_y(\mathcal{S}) := \bigcup_{u \in \S} \Theta_y(u). \]
At least intuitively, it’s clear that this random set would not be smaller than Dempster’s, so we have effectively created more imprecision. But this
imprecision-creation is done in a strategic way so that the validity property \eqref{eq} is achieved. A specific result (Theorem 4.3) states that there is
no benefit to the use of a non-nested random set \(\mathcal{S}\). But if \(\mathcal{S}\) is
nested, then \(\Theta_y(\mathcal{S})\) will be nested too, which implies that the IM output—derived from the distribution of
\(\Theta_y(\mathcal{S})\), as a function of the random \(\mathcal{S}\), for fixed
\(y\)—takes the form of a consonant belief/plausibility function or, equivalently, a necessity/possibility measure. These
developments make up the majority of Chapters 2-5 in the book.

After a development of the general ideas described above, the book considers some more refined questions about the IM’s efficiency, i.e., how sharp can the
inferences be. A critical element to the efficiency question is the dimension of the auxiliary variable \(U\). There will be
cases when the dimension of \(U\) can be reduced, and this will lead to a gain in efficiency. There are two general and
practically important cases where this dimension reduction is needed, namely, when combining information about a common unknown from multiple sources and
when only certain features of the full parameter are of interest. These efficiency-seeking extensions are developed in Chapters 6-7 of the book, respectively.
Note that, although we are working with belief/plausibility functions like in Dempster’s framework, the famous
*Dempster's rule of combination* does not
preserve validity (it’s not designed for that), so new ideas are needed.

The remainder of the book focuses on applications and extensions of the above ideas. In particular, a notion of validity is defined for prediction problems and
a corresponding IM achieving this is constructed in Chapter 9. In Chapter 10 we develop an IM approach for the *multiple testing problem* that involves constructing
a random set that is “optimal” with respect to a collection of assertions simultaneously. Chapter 11 develops an important generalization of the basic IM
construction that relaxes the requirement that the association “\(Y=a(\theta,U)\)” completely characterize the distribution of
\(Y\). This generalization has proved to be useful recently in the construction of IMs in certain nonparametric or model-free
applications (e.g., [CellaMartin2020]).

## What’s not in the book?

There have been a number of important developments in the years after the book was published that are worth mentioning here.

- It was mentioned above that an IM whose output is additive, i.e., a probability measure, cannot achieve validity. This was established more formally in
[Balch2019], via the so-called
*false confidence theorem*. The point is that, when the IM output is additive, there will always be false assertions about \(\theta\) that will tend to be assigned high probability. This highlights the importance of imprecision when one desires strong control on the operating characteristics of the IM output. For more details, see [Martin2019]. - It is easy to show that validity in the sense of \eqref{eq} implies that procedures derived from the valid IM have frequentist error rate control. What about
the converse, i.e., are there procedures that control error rates that can’t be derived from a valid IM? I have recently shown
[Martin2021] that the answer is, effectively, No. That is, for
*any test or confidence procedure*that provably controls frequentist error rates, there exists a valid IM whose derived procedure is at least as efficient as the one given. Therefore, there are no good frequentist procedures that are beyond the reach of valid IMs. Even good Bayesian credible intervals can be re-expressed as being derived from a valid/non-additive IM. - The validity property is focused exclusively on frequentist calibration and reliability. Much of the imprecise probability literature, on the other hand, takes a
more subjective or personal view of probability, so the focus is largely on properties like
*coherence*that ensure a gambler is internally rational in the sense that, relative to his beliefs and available information, he can’t be made a sure loser. At least intuitively, it shouldn’t be possible for something that’s reliable to be internally irrational, so one would expect some connections between these two ideas. There has been some recent work along these lines (e.g., [CellaMartin2020,Martin2021b] but there is no mention of coherence or related notions in the book. - Chuanhai and I were working mostly from first principles so, regrettably, there is very little mention of the relevant imprecise probability developments in the book. I’ve since made this connection with the imprecise probability literature so I’m trying to tie things together in my current work. But it’s particularly exciting for me to see that so much of what Chuanhai and I developed on our own closely aligns with things I’ve been reading about recently in the imprecise probability literature. That’s a clear sign that we were on the right track back in 2008!

## What’s next?

While the *no-prior-information* setup is practically relevant, one could also argue that every problem has at least partial prior information available. So as an
alternative to the all-or-nothing Bayes vs. frequentist dichotomy that exists in the statistics literature, one can easily imagine a spectrum where the more I’m
willing to assume (in the form of prior information), the more precise I can be in my inferential statements. In the case of complete prior information, an
additive IM can be given in the form of the precise Bayesian posterior probability; for the classical frequentist case with no prior information, what’s described above
(and in more detail in the book) leads to a non-additive IM that’s generally in the form of an imprecise necessity/possibility measure.

Then there are cases in between the two extremes, ones with partial prior information available. The problems I have in mind are those where the parameter is high-dimensional but is known to have (or we’re willing to assume that it has) a certain low-dimensional or low-complexity structure, such as sparsity. One might be willing to specify a prior distribution for the “complexity” of the parameter but not be willing to make precise probability statements about the particular features of the parameter at a given complexity level. One option would be to ignore the prior information and achieve validity by using the IM developments described above and in the book. But presumably there’s an opportunity to improve the IM’s efficiency by incorporating this partial prior information in some sense. My current focus is on incorporating this partial prior information in an efficient way that retains (some sensible version of) the validity property.

## Main references

[Efron1998] B.Efron. “R. A. Fisher in the 21st century (Invited paper presented at the 1996 R. A. Fisher Lecture)”. In:
*Statistical Science* 13(2) (1998), pp. 95-122.

[Efron2013] B.Efron. “Discussion”. In: *International Statistical Review* 81(1) (2013), pp. 41-42.

[CellaMartin2020] L.Cella and R.Martin. “Validity, consonant plausibility measures, and conformal prediction”. In:
*Researchers.One* (2021).

[Balch2019] M.S.Blach, R.Martin and S.Ferson. “Satellite conjunction analysis and the false confidence theorem”. In: *Proceedings of the Royal Society A - Mathematical,
Physical and Engineering Sciences* 475(2227) (2019).

[Martin2019] R.Martin. “False confidence, non-additive beliefs, and valid statistical inference”. In:
*Researchers.One* (2019).

[Martin2021] R.Martin. “An imprecise-probabilistic characterization of frequentist statistical inference”. In:
*Researchers.One* (2021).

[Martin2021b] R.Martin. “Towards a theory of valid inferential models with partial prior information”. In:
*Researchers.One* (2021).