# Tathagata Basu’s PhD thesis on High dimensional statistical modelling under limited information

Posted on September 12, 2022 by Tathagata Basu (edited by Henna Bains)

[ go back to blog ]

After three years of research and extensive brainstorming with my supervisors; Jochen Einbeck and Matthias Troffaes, I finally defended my thesis on 15th December 2020. The thesis, entitled “High dimensional statistical modelling under limited information” was examined by Dr Hailiang Du and Dr Erik Quaghebeur in the presence of Dr Ostap Hryniv.

## Motivation

My doctoral thesis was focused on the statistical relationship between a response variable and some input variables, when the number of input variables is more than the number of observations. There are some classical regularisation-type methods, which are usually based on certain asymptotic properties of variable selection problems. However, this may not be very reliable because those asymptotic properties rely on the assumption that the observed data is the true representation of the population. This motivated us to perform a sensitivity analysis on the regularising component of the LASSO-type methods to check the robustness of the variable selection. As many statisticians argue, Bayesian approaches are inherently regularisation methods. Therefore, a very natural way to check this sensitivity in Bayesian paradigm is to perform a robust Bayesian analysis over a set of shrinkage priors on the regression coefficients.

## Likelihood-based approaches

To start the thesis, we initially investigate the different frequentist methods or the ‘likelihood-based approaches’. These methods rely on the additional penalty term(s) on the likelihood function to force some of the regression coefficients to be zero [4]. This way, we can perform an automatic variable selection. A very obvious way to check the robustness of such methods is to perform a sensitivity analysis on the penalty term. This way, we can track the changes in dimensionality with respect to the penalty terms, which, from a decision theoretic perspective, can also be seen as the constraint on the cost function. These sensitivity analysis based techniques also give us a robust binary classifier for logistic regression problems with sparsity constraints [1]. We see that for continuous type inputs, this classifier regularly performs better than the IDM-based classifiers as we don’t need to categorise the inputs in the first place.

## Robust Bayesian variable selection

Likelihood-based approaches give us an idea about the quality of the data but they are still a little limiting, as they do not allow us to incorporate any prior information. A very natural way to incorporate subjective information in our analysis is to tackle the problem in a Bayesian setting and we looked into different Bayesian variable selection routines. After an extensive literature review, we realised that indicator based selection methods are convenient to perform robust Bayesian analyses in an explicable manner. So, we decided to use spike and slab type priors [3] to specify the regression coefficients. Additionally, we use imprecise beta distributions to specify the bounds of our prior expectation of the selection probability of each covariate, that is, the selection probability of the selection indicators. Throughout the process, we consider conjugate priors, which give us closed form expression of the posteriors. Earlier, we noticed that for the orthogonal design case, the posterior selection probability of each covariate is monotone with respect to our choice of prior hyperparameter for the selection indicators (or simply, ɑ). This was interesting from a theoretical point of view to shed some light on the importance of prior specification for variable selection but the setting was not very realistic. Later on, whilst writing a paper [2], we could actually generalise this result for the general design case and could show that posterior model selection probability is monotone with respect to ɑ.

## Issues

One aspect of linear regression is model fitting, where we are interested in the goodness of fit. In robust Bayesian analysis, we have a set of posteriors, which makes model fitting non-trivial. We introduced two different measures for this purpose. However, these are very crude ways of explaining goodness of fit as well as indeterminacy in model fitting. We would like to have a more sophisticated way of defining measures of accuracy, which can be compared with other methods as well. Perhaps, a utility based trade-off will be an interesting thing to look out for.

## References

[1] Basu, Tathagata, Troffaes, Matthias C. M. and Einbeck, Jochen ‘Binary Credal Classification Under Sparsity Constraints.’, Information Processing and Management of Uncertainty in Knowledge-Based Systems, Proceedings of the 18th International Conference IPMU 2020, Springer pp. 82-95, 2020, https://doi.org/10.1007/978-3-030-50143-3_7

[2] Basu, Tathagata, Troffaes, Matthias C. M. and Einbeck, Jochen `A robust Bayesian analysis of variable selection under prior ignorance’, Sankhya A, (accepted for publication) https://arxiv.org/abs/2204.13341

[3] Ishwaran, Hemant and Rao, J S. Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics, 33(2):730–773, Apr 2005. ISSN 0090-5364. https://doi.org/10.1214/009053604000001147

[4] Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288, 1996. ISSN 00359246. https://www.jstor.org/stable/2346178