This article is the fourth in an editorial series that will review how predictive analytics helps your organization predict with confidence what will happen next so that you can make smarter decisions and improve business outcomes..
Predictive analytics can be grouped into two general classes: supervised learning methods (e.g. regression, classification) and unsupervised methods (e.g. clustering) as mentioned in last week’s article.
There is a vast array of predictive analytics tools, but not all are created equal. Software differs widely in terms of capability and usability — not all solutions can address all types of advanced analytics needs. There are different classes of analytics users — some need to build statistical models, others just need to use them.
For the advanced user, the importance of tool selection centers on the ability to put proprietary models into the hands of business users (front line decision makers) so that they can act competitively with predictive analytics, hiding the complexity of these proprietary models under the hood.
Business users possess the domain knowledge necessary to understand the business answer they are looking for from predictive analytics, but at the same time they don’t need, want to, or can’t develop the models themselves. So the optimal tool provides an easy method of putting the data scientists’ expertise in the hands of frontline decision makers, often times in the form of a guided analytic application, with the predictive model encapsulated under the covers. This enables best practices use of advanced analytics (i.e. taking the risk out of having business people try to develop their own models), and broad deployment of secret-sauce analytics.
When selecting the right tool for your organization, you need to ensure you choose a tool which has the depth and breadth of capability, from simple out-of-the-box functionality for the easiest problems to the most advanced statistic capability to support data scientists, so that competitive models can be embedded in business users’ analytic dashboards for day-to-day use.
Effective predictive analytics tools provide a wide variety of algorithms and methods to support all the data characteristics and business problems that users encounter, as well as the ability to flexibly apply these algorithms as needed. The extensibility to easily integrate new analytic methods as they become available is also critical for maximizing competitive advantage. An important criteria when selecting the right tool, is to make sure the feature set matches your business data characteristics and that the tool will benefit your data analysts. The right tool typically combines powerful data integration and transformation capabilities, exploratory features, analytic algorithms, all with an intuitive interface. In essence there are three important ingredients providing a recipe for success in utilizing predictive analytics: (i) the data scientist builds the most competitive model, (ii) the analytic application author embeds the competitive model into the analytic application, and (iii) the business user engages the competitive model as part of the regular flow of business.
Here is a short list of characteristics and considerations to focus on when evaluating a predictive analytics tool:
- Consider the processing capabilities of the analytics tool for addressing the needs of the predictive analytics cycle — data munging, exploratory data analysis, predictive modeling techniques such as forecasting, clustering, and scoring, as well as model evaluation.
- Find a tool that supports combining the analyst’s business and data knowledge with predefined procedures and tools, and graphical workflows to simplify and streamline the path from preparation to prediction.
- A good tool must easily integrate with the data sources required to answer critical business questions.
- The tool should be readily usable by all classes of users: business users, business analysts, data analysts, data scientists, application developers and system administrators.
- Consider tools that serve to minimize the need for IT professionals and data scientists to set-up integration with multiple data sources.
The goal in selecting a robust tool is to secure a broad range of predictive analytics capabilities — from the simplest, such as trend lines and a forecast tool, all the way through to leveraging an entire ecosystem of statistical capabilities where you have the full depth of capability in creating and executing any type of statistical model or algorithm. Out-of-the-box/standard algorithms aren’t going to gain you a competitive advantage once your competitors start using those same tools. You need the tools to create your own proprietary models that will allow you to build that competitive advantage by leveraging your enterprise data assets.
A best-practice choice is a solution that integrates predictive analytics within the entire analytic decision-making process, allowing it to be incorporated where appropriate into self-service dashboards and exploratory data discovery. This orientation provides advanced analytics access to all analytic users, giving them the tools necessary to spot new opportunities, manage risks, and swiftly react to unforeseen events. Further, professionals managing mission-critical departments and global processes have the ability to immediately and intuitively ask questions and get answers from their data — anticipating what’s next, taking quick and educated actions.
R as the Choice for Predictive Analytics
Although there are many choices for performing tasks related to data analysis, data modeling, and predictive analytics, R has become the overwhelming favorite today. This is due to the widespread use of R in academia over commercial products like SAS and SPSS, where new graduates enter industry with a firm knowledge of R.
There are currently spirited debates between the R user community and both the SAS and Python communities as to what is the best tool for data science. R has compelling justifications including the availability of free open source R, a widely used extensible analytical environment, over 5,000 packages available on CRAN to extend the functionality of R, and top-rated visualization capabilities using ggplot2. In addition, R enjoys a thriving user community flush with local Meetup groups, online courses, and specialty blogs (see top blogs via consolidator: r-bloggers.com).
Open source R is a logical first choice for predictive analytics modeling as the statistical environment contains a number of algorithms in the base R package as well as additional packages that have extended functionality.
- Linear regression using lm()
- Logistic regression using glm()
- Regression with regularization using the glmnet package
- Neural networks using nnet()
- Support vector machines using the e1071 package
- Naïve Bayes models using the e1071 package
- K-nearest-neighbors classification using the knn() function from the class package
- Decision trees using tree()
- Ensembles of trees using the randomForest package
- Gradient boosting using the gbm package
- Clustering using kmeans(), hclust()
The only issue with the open source R engine is its inherent limitation as a scalable production environment. R is notoriously memory-based, meaning it can only run on the confines of its compute environment. A good best practices policy for implementing R in production would be to leverage a commercial, enterprise-grade platform for running the R language, such as TERR (TIBCO Enterprise Runtime for R), in order to get the great benefits of R, while avoiding the scalability challenges.
Next week’s article will look at Data Access and Exploratory Data Analysis. If you prefer the complete insideAI News Guide to Predictive Analytics is available for download in PDF from the insideAI News White Paper Library, courtesy of TIBCO Software.