Despite the obvious impact of the most salient macro level trends impacting data science—including Artificial Intelligence, cloud computing, and the Internet of Things—the ends of this discipline remain largely unchanged from when it initially emerged nearly 10 years ago.
The goal has always been to equip the enterprise with tailored solutions spanning technological approaches that not only justify, but also maximize the use of data for fulfilling the most meaningful business objectives at hand.
Oftentimes, those involve the upper end of the analytics continuum in the form of predictive and prescriptive measures. Currently, cognitive computing deployments factor substantially into data scientists’ abilities to complete this task.
Ergo, the most profound developments affecting this space in 2022 reduce the traditional impediments to devising the underlying models that support applications of Natural Language Processing, cognitive search, image recognition, and other advanced analytics manifestations.
There are relatively new, established, and resurgent data science approaches that make it much easier to work with unstructured data, reduce the sheer quantities of training data required to build models, and decrease the manual efforts for providing labels for that data.
Most exciting of all, many of these techniques operate at the nexus point between supervised and unsupervised learning, the two conventional methods underpinning most machine learning solutions. The impending collapse of this divide is unfolding a new world of opportunities that make data science more accessible and facile than it’s ever been.
Plus, by relying less on strictly supervised learning approaches, this data science trend is furthering AI’s march towards replicating human intelligence, since it’s primarily “a combination of this supervised and unsupervised learning,” reflected Wayne Thompson, SAS Chief Data Scientist. “Most of us humans learn through an unsupervised type way.”
Intersecting Supervised and Unsupervised Learning
Unaided, supervised learning requires tremendous data quantities and time consuming annotations of business outcomes or factors influencing them. Unsupervised learning also involves inordinate training data, yet identifies patterns or features in them without annotations. Between these two approaches there’s a range of techniques that either involve subsets of one or the other, both, or additional techniques related to the aforementioned two to reduce either the amounts of training data or labels involved. These methods include:
- Self-Supervised Learning: According to Thompson, this approach enables machine learning with “no labeled data whatsoever.”
- Semi-Supervised Learning: This technique is somewhat similar to self-supervised learning but “you have to provide a small amount of labeled data, even if you do this artificially where you inject small amounts of labeled data into the unsupervised system,” Thompson noted.
- Generative Adversarial Networks: GANs are part of what Gartner is terming Generative AI. These networks can generate data that other forms of machine learning can use for learning.
- Representation Learning: This learning technique encompasses a variety of approaches for ascertaining data representations where it’s easier to find prominent features for building predictors like classifiers. It can do so by discerning features in unlabeled data by training models on another supervised learning task.
- Contrastive Learning: Contrastive learning is a form of representation learning in which similarities in a dataset are close to each other while the differences are far. It’s an effective approach for semantics in which, if there’s a relationship between ‘man’ to ‘king’ and the model is given the term ‘woman’, “That gives this parametric mapping of the raw data into this feature vector where that resolves to ‘queen’,” Thompson explained.
- Manifold Learning: This form of learning provides non-linear dimensionality reductions in a lower space. It’s frequently used in NLP with word vectors.
- Reinforcement Learning: Although training data isn’t necessary with reinforcement learning (which hinges on agents learning from dynamically interacting with environments), reinforcement learners can learn from synthetic or generated datasets simulating environments.
- Transfer Learning: This time-honored approach conveys the learning of one model, what Indico Data CEO Tom Wilde termed a “generalized model”, into another that’s oftentimes specific to an organization’s individual use case for things like computer vision or text analytics.
Training Data Complexities
The amount of training data necessary to build credible machine learning models for business applications is inordinately large and serves as the main inhibitor for applications of supervised or unsupervised learning. Certain domains simply don’t have enough of such data, which can potentially unhinge data science efforts for them. Approaches involving transfer learning, GANs, and reinforcement learning ameliorate this issue by either decreasing the amount of training data required or generating enough data on which to teach models. These methods also help with the labeled data issue discussed below. “With supervised learning the barrier has historically been the supervision,” Wilde observed. “You need tens of thousands of examples before the machine learns what you’re trying to teach it. Transfer learning cuts that down to a few hundred.”
The generative prowess of GANs is ideal for creating data for which reinforcement learners can interact in a simulated environment. The former is responsible for the deep fake phenomenon and creating lifelike images, which is an example of generative AI going awry. Within the confines of the enterprise, however, “you’re seeing a combining of GANs with reinforcement learners to put this synthetic tabular data generation into the reinforcement learning process,” Thompson commented. If that process happens to be around a business objective like converting sales prospects into customers “you can use GANs to simulate new data to train the reinforcement learner,” Thompson added.
Annotating Data
The other caveat for data science projects involving supervised learning (which comprises the majority of such endeavors) is the extreme amounts of work and money required to label data—when there’s enough found. Aside from the strategies for transfer learning, GANs, and reinforcement learning identified above, other approaches for expediting the labeling of data involve:
- Unsupervised Learning: With unsupervised learning, models find characteristics in the data themselves about a specific business problem with techniques like clustering. According to Katana Graph CEO Keshav Pingali, “Louvain clustering is a popular algorithm for building a hierarchical representation of data, and is likely the most popular cluster technique fintech is using today.” Coupling unsupervised learning with supervised learning diminishes the amount of labeling otherwise needed.
- Neuro-Symbolic AI: This form of AI utilizes its statistical and knowledge foundations in tandem to greatly reduce the reliance on labeling data. Kyndi CEO Ryan Welsh mentioned that “symbolic AI, because of the knowledge representation you’re able to learn from the data, allows you to then transfer that representation to different tasks, not requiring you to label data anymore.” This approach is critical for saving the time and energy of subject matter experts who are otherwise endlessly labeling data. “So you save time in general and save money because the time of the experts is more expensive and costs more than other people’s,” emphasized expert.ai CTO Marco Varone.
- Encoding and Embedding: As previously mentioned, contrastive learning doesn’t require annotated data. Instead, these models perform encodings in what’s a lower dimensional space to glean relationships between attributes in data. According to Thompson, “With contrastive learning and getting these embeddings, you can now train these models, many of which are like deep learning models. Google did this on a whole corpus of Wikipedia so you can imagine how good this model is given how many trends it can evaluate.” Manifold learning and its embeddings also deliver utility for unlabeled datasets.
Structuring the Unstructured
Regardless how it’s invoked, almost any machine learning technique for unstructured content like images or text is able to provide structure to it so “you can leverage your RPA investments and analytics by bringing unstructured data, which is typically opaque and difficult to wrangle, into those existing investments,” Wilde indicated. The digital software agents powering RPA are instrumental in this regard when, equipped with machine learning, they transfer unstructured text data into structured data systems. Wilde identified a use case in which a well known insurer could “ingest annuity documents, classify and extract data, then analyze it to see if annuities are in good order.”
There are other techniques for rendering what’s widely considered unstructured text into a conventionally structured tabular format. “Text is structured; people don’t think it is but the way we do that is through counting,” Thompson propounded. With this methodology, each row in a table is a document while each column includes the terms in it. “You just count how often each term appears across each row of documents and you total those up,” Thompson disclosed. “Suddenly, you’ve basically taken textual data and converted that into a numerical representation that you can then model.”
The Point
Data science has always been characterized by innovation and opportunity. The above developments regarding the hybridization of supervised and unsupervised learning to overcome the training data and annotation issues plaguing the former produce the profound effect of making it easier for organizations to leverage advanced analytics models. Subsequently, data science barriers (including unstructured data) are systemically falling while machine intelligence is gaining on human intelligence—behooving the enterprise.
About the Author
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.
Sign up for the free insideAI News newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1