The Trade-off Between Advanced AI/ML Models and Simple Linear/Logistic Regressions for Accuracy
Understanding the Algorithms
Older complex machine learning uses tons of data to train AI. For example, a popular computer-vision data set known as MNIST, designed to train models to read human handwriting, contains 60,000 handwritten images from 0 to 9. Models like ChatGPT require billions of words to learn to produce human-like texts.
These methods are expensive and require immense computational power, not to mention data preprocessing. For example, an image dataset, ImageNet, used to train AI tools in visual recognition, contains thousands of manually sorted categories.
Advancements in deep learning promise to build smarter models. For example, an algorithm known as “transfer learning” allows to train an AI model to find kidneys in ultra-sound images by using only 45 image examples. Recently, a tool designed by a group of MIT researchers demonstrated an ability to learn to recognize handwritten by condensing the above-mentioned MNIST database to only 10 images of every digit.
Yet, as ML algorithms learn to generalize from smaller data and acquire more human-like reasoning, they also become more complex and harder to interpret. Data scientists refer to them as black-box ML models for the lack of transparency in how they build conclusions.
In terms of high-stake decisions, like in medicine, finance, or criminal justice, their complexity becomes an obstacle. You can never predict when a complex algorithm can lead to wrong or even harmful decisions. What is more important is that the mechanisms for their audit are scarce.
On the other hand, we have predictable algorithms like linear regression. Despite being around for more than a century, this tried and proven statistical method is still widely used for its many benefits. Linear regression is easier to interpret. Its transparency allows us to categorize it as one of the glass box ML models. Linear regression uses a majority of audit methods that have been developed to check its accuracy for decades. Given this, regression models get preferences when high-stake decisions (like saving human lives) are made.
Data Availability and Quality
When it comes to data availability and quality, advanced ML algorithms may do a better job than more traditional simpler models.
The thing is that the present-day AI applications face the following challenges: lack of structure, scarcity, and uneven distribution.
Despite the amount of data worldwide growing at a compound annual growth rate (CAGR) of 23% and is projected to reach 175ZB by 2025, this data will not suffice to train new algorithms by the mid-2020s. Moreover, even in the following years, about 80% of compound online data will still be unstructured. Also, there are areas where assembling big data is complicated or impossible, like rare natural phenomena, while other domains benefit from decades of data collection and processing.
Complex AI algorithms effectively target the lack of data structure and scarcity. Therefore, the use of small data advanced algorithms will be effective in cases like:
Building speech recognition tools for languages with small learning corpora
Predicting unconventional weather phenomena
Calculating disease risk in populations with no digital health records.>
Yet, there are areas where fail-proof safety guarantees and the ability to check the final output are what matters here most.
Given the abundance of data and the demand for algorithm transparency, the traditional domains of healthcare, banking, or finance are a perfect setting for the usage of regression algorithms. These areas give stage for the following scenarios:
Insurance companies use regression analysis to check up their clients’ health to establish risky cases
E-commerce companies use regression algorithms to predict peak loads, correlation between sales and loyalty programs
Stock traders employ regression to predict market risks.
Accuracy and Performance
Before the emergence of small data learning algorithms, the amount of data needed to train an algorithm was the main factor that influenced its success. Now, things may be different. Below, we investigate two use cases where complex AI models and simple regression algorithms perform at their best:
Use case 1: Recently, a group of researchers from Hasso Plattner Institute and the University of Potsdam, Germany, used the advanced transfer learning algorithm to build a German-language speech recognition tool based on a relatively small set of data.
The findings: by using only 383 hours of recordings, the algorithm achieved the efficacy of an older model in 10 hours, compared to the previous 60 hours. Also, the error rate was 15.05% vs. 22.78% of a simpler model. The transfer learning algorithm used only 5.5 GB of computing memory compared to 10.4 GB of the traditional model in the same time span of 25 hours.
Thus, in the language model training scenario, a more advanced small data algorithm allowed for considerable cuts in learning time and computational power while increasing the accuracy of the output.
Use case 2: A joint research by Inholland University, Netherlands, and the University of Toronto compared traditional models, mainly linear regression and regression trees, and more advanced support vector machines, neural nets, and random forests models to predict disease development in 6194 patients with various brain diseases.
The findings: The traditional linear regression model proved more stable efficacy, given the same amount of data used. Although all the models were considerably data-hungry, in the case of medical prediction, simple algorithms turned out to be more trustworthy.
Trade-offs and Decision-Making
To make a proper decision on which model type to choose, a clear understanding of the problem specifics and a clear vision of the end result are needed.
Thus, advanced ML algorithms, such as transfer learning, may prove effective in scenarios where scarce data are available, or if a business doesn’t have access to large volumes of computational power, or the time for training the model is limited.
Although they are simpler to build, traditional approaches like regression algorithms often require more data, training time, and power. Yet, it is not a huge trade-off when a use scenario requires a high level of safety. Additionally, the interpretability of simpler methods makes them a number one choice when it comes to privacy compliance. For example, the European GDPR rule guarantees EU citizens full transparency on how their digital footprints are used.
Applications and Use Cases
Modern industries effectively make use of both advanced and simpler machine learning algorithms, depending on the specifics of each domain.
Here are several examples of how small data algorithms may be used in practice:
Smarter robotics: recently, Mark Zuckerberg, Jeff Bezos, and Marc Benioff have invested in a CA-based startup that promises to develop artificial general intelligence to teach robots to generalize from a few examples, which will eventually cut costs on the training of industrial robots.
Better industrial equipment control: in situations where data are scarce for both human and artificial intelligence, advanced mechanisms using top-down reasoning may provide for faster decision-making. For example, Siemens uses such an algorithm on their plants to control gas combustion processes.
Dealing with uncertain scenarios: Google’s parent company, Alphabet, has launched an initiative to provide underserved regions with stable internet connections. Theor Project Loon uses huge antenna balloons hovering in the stratosphere to form a connection network. Each of the balloons is equipped with an AI mechanism that reacts to highly unpredictable stratosphere winds and navigates the ballon to keep it hovering over the necessary place.
Now, here’s how linear regression may be used in practice:
Revenue prediction: e-commerce companies use linear regression algorithms to calculate how their spending will affect revenues.
Drug dosage calculations: pharma researchers apply regression to calculate how drug dosage will influence human blood pressure or other states.
Sports performance prediction: in sports, regression algorithms are used to predict a sportsman’s performance based on previous games. For example, Thomas Tuchel, former Chelsea Football Club manager, explained that they brought on goalkeeper Kepa Arrizabalaga late into extra time based on statistics of his saved penalties.
As we see, each of the models is applied in the field where it is best fitted. Yet, in the near future, it may be possible to avoid trade-offs due to the development of hybrid approaches.
Future of AI and ML in Data-Scarce Environments
If, by now, AI has been mostly introduced in data-rich environments, its expansion into the real world will inevitably lead to scenarios where the data will be scarce or not sufficient.
For example, self-driving cars, which are trained to avoid pedestrians, turned out to be unable to avoid children dressed in unusual Halloween outfits. An issue of the same nature is the inability of iPhone X’s facial recognition to read the morning faces of their users.
Situations like these drive the development of small data AI algorithms, having top-down reasoning that is closer to how humans generalize from scarce data. Yet these algorithms still remain black boxes.
Therefore, there is a third way of AI research aiming to take the best of two worlds: keep the benefits of the advanced algorithms when pertaining to the interpretability of simpler approaches. For example, a group or researcher from Warsaw University presents a machine learning framework that promises to interpret machine learning algorithms of any complexity.
Such hybrid approaches seem to be the very way to balance the safety and performance the AI research needs.
Have a question?
Speak to Data Scientist
Jagdeep ChawlaMS in Data Science
NorthWestern Univeristy, Illinois
MS in Data Science
NorthWestern Univeristy, Illinois