Big data is all the rage today, and rightfully so. State-of-the-art language models powered by big data, like GPT-3, can write beautiful prose, create realistic news articles, translate text, write functional code in any language, and more. Further, state-of-the-art vision models trained on massive datasets are bringing us towards level 5—or fully autonomous—self driving cars.
While big data can fuel astonishing results, organizations can gain value from “small data” as well. In this article, I’ll highlight four ways to circumvent the need for big data.
1. Exploratory Analysis
Whether you’re working with big or small data, you should understand your data before you try to gain deep insights from it. This includes calculating simple descriptive statistics, like count, mean, quartiles, the minimum, the maximum, and so on.
Slightly more complex analyses include histograms, scatterplots, pie charts, and so forth. Further, correlation analyses can be done to confirm or reject hypotheses about how the data is related. You’ll also want to analyze data quality, and deal with problems like missing data and outliers.
Anything that helps you understand the data itself should be done at this stage.
2. Basic Machine Learning Models
Machine learning is a lot more than just deep learning, and alternative techniques like decision trees are far simpler, more explainable, and more resource efficient, while working well with less data.
Slightly more complex techniques, like Random Forest and Support Vector Machines, also work great on smaller datasets, while still being much easier to set up than neural networks.
Highly complex techniques like deep learning shine for tasks like image classification and natural language processing. For these kinds of problems, having more data is almost always better. That being said, there are even ways to combine these approaches, such as with neural-backed decision trees, that offer high accuracy on tasks like image classification, while maintaining the relative simplicity and explainability of decision trees.
3.Transfer Learning
Another method is transfer learning, which allows you to transfer the knowledge learned in one dataset and apply it to another dataset. As a result, you don’t have to start from scratch, and you can train machine learning models with far less data.
For example, companies can currently beta test OpenAI’s GPT-3 model, which allows you to generate natural language of any kind, without needing to train on any data at all. This is an example of zero-shot learning. To increase the model’s accuracy for your specific use-case, you can train the model on a small amount of your own data, known as few-shot learning.
In either case, the model is already trained on a corpus of almost the entire Internet’s text, and the learning is available for you to get an accurate language model out-of-the-box.
For other tasks, like image classification, you can apply transfer learning using models like VGG16 or ResNet50.
4. AutoML
Another method to quickly deploy AI, without needing big data, is by using turn-key automated machine learning solutions that are pre-trained on big datasets.
Some products include Google Cloud’s AutoML, Salesforce Einstein AutoML, Microsoft Azure AI, and Amazon AutoGluon. With so many options to choose from, AutoML is a great way to implement AI in your organization, even if you don’t have big data.
Conclusion
It’s a common misconception that machine learning needs big data. Statisticians have been working with small data for decades, and techniques like exploratory analysis, classical machine learning, and AutoML are great ways to gain insights from any data set, no matter the size.
About the Author
Shanif Dhanani is a former Twitter data scientist and engineer turned CEO of Apteo. Apteo is a no-code analytics platform anyone can use in a matter of minutes to extract deep insights in their data.
Sign up for the free insideAI News newsletter.