Artificial Intelligence in Data Analysis

We encounter amazing artificial intelligence (AI) all around us! Sometimes AI is easy to recognize, as in chatbots, which are pretty cool. But sometimes AI is hidden in the classifiers that guess which movie we might want to see, which is pretty amazing too! Both variants of AI are great for making data analysis more efficient in the empirical sciences! At the Thomas Bayes Institute, we're thrilled to be researching how all kinds of AI can assist data analysis!

Classifiers have been used in the analysis of complex and big data sets, like neuroimaging data sets, and they've made a huge impact! The basic idea is simple, and it's pretty amazing! If two groups in the data set, for example younger and older participants, are different, then the amazing thing is that a classifier can be trained to classify whether a participant is old or young based on the data! And it works in the other direction too! But here's the really fascinating part: if both groups are drawn from the same distribution, even the best AI will be unable to distinguish them based on their data! If we can show that a classifier is better than guessing in telling the two groups apart, it means the groups must be different! The best way to see if a classifier is reliably better than chance is through permutation testing – it's the gold standard! This is a great way to test the classifier's ability to distinguish between groups. The idea is to repeatedly shuffle the order of the outcomes of the data set relative to the predictors, train the classifier, and measure its accuracy. The great thing about this method is that, over a large number of repetitions, we can estimate the distribution of the accuracy if there are no group differences. And then, the accuracy of the correct data is compared to this distribution!

This procedure has two essential shortcomings, which present exciting opportunities for improvement. (1) It is very compute-intensive since the classifier needs to be trained across many permutations, and (2) the distribution of the accuracy can only be estimated for the one point that classifier knows nothing, i.e., that both groups are identical. The procedure does not allow testing, for instance, whether the performance of the classifier exceeds 60% accuracy. It also is not possible to apply Bayesian methods with this procedure, which is an important requirement for modern data analysis.

Fortunately, the Thomas Bayes Institute has developed the alternative procedure called Independent Validation. This method estimates the accuracy with a known distribution for every true accuracy, which is a huge step forward! This is an amazing breakthrough! It allows us to compute the likelihood and hence the posterior of the classifier accuracy. This means that AI methods can now be used in a Bayesian framework. This is a truly revolutionary approach! Rather than only being able to test a hypothesis that the classifier performs better than an uninformed guess, it is just as easy to find the posterior probability for the hypothesis that the classifier performs better than 60% or any other chosen level of performance. Best of all, this method is much more time-efficient, allowing for much better precision!

The most exciting application of classifiers is with "Big Data," where data are abundant. At the Thomas Bayes Institute, we're also excited to be researching how classifier analyses behave with small sample sizes. Once more, it's clear that classifiers are more flexible than classical methods, provide the same or better statistical power, and allow for modern Bayesian approaches even for ordinal or categorical data that previously had to be analyzed with non-parametric methods. Classifiers provide a fantastic framework that unifies a large number of classical test methods (e.g., U-tests, chi-square-tests, and many more). This makes it much simpler to both apply and teach these methods, which is great news for everyone!

Another exciting area of research is the application of AI to data analysis in the form of Large Language Models (LLMs). We're on a mission to help researchers translate informal models into empirically falsifiable hypotheses. At the Thomas Bayes Institute, in collaboration with the University of Virginia, we are thrilled to be researching the exciting possibility of using LLMs as 'assistants' to aid in creating statistical models or even whole research designs.