In this special guest feature, Bill Franks from Teradata talks about how to get the most out of the R open source statistical environment when doing business analytics. He also tailors expectations in terms of several perceived limitations of R and how to get around them. Frank is Chief Analytics Officer at Teradata Corporation. He is also the author of Taming The Big Data Tidal Wave and The Analytics Revolution.
The popularity of the open source statistical software R has been growing rapidly in recent years. While it is true that the lack of license fees for R can save money, don’t assume that R will be a magic bullet for your analytics processes. R can drive immense value, but only when it is utilized in the right way and with the right planning and expectations. Let’s explore a few points to keep in mind as you consider utilizing R.
First, as the opening line on R’s website says, “R is a language and environment for statistical computing and graphics.” To drive value from R, just as with any other tool, will require work. The most common method of using R is to directly program code. The R community does not focus on adding user interfaces nearly as much as on creating packages with new functionality. This means that while R packages probably exist to do just about anything you need done, someone on your team must write the code to leverage those packages. There are some user interfaces coming available that sit on top of R. However, most of them are nowhere near as robust as other commercially available tools today. If you and your team don’t want to code, you will have some challenges with R.
Next, R runs in-memory by default. Given the hype around in-memory computing, this might at first sound like a great thing. However, R will run in-memory in the local environment you are using in a single threaded fashion. If your environment is a laptop or even a quite powerful server, then the amount of data that can be processed will be disappointingly small. If you need to analyze anything approaching big data, then R won’t help much out of the box. There are commercial options available, however, that help bypass R’s memory constraints while also parallelizing its computations so that users can leverage R against any size of data. Revolution Analytics and Teradata Aster R are two examples. You’ll certainly want to leverage an option that scales R to make use of it at an enterprise level, which can take some additional time to research and cost some money. However, once a scalable R solution is in place, you won’t have to worry further about scaling your processes.
Finally, R is oriented towards power users who understand how to analyze data and how to leverage common statistical, predictive, and machine learning algorithms. We already discussed previously the need for programming prowess to use R. However, programming skills are not enough. In addition to knowing how to program, users will need to be well versed in analytic methodologies if they are to drive value with R. The power of R is in the many highly complex algorithms available, but the challenge is ensuring you make use of them correctly. If you don’t have people with the analytic skills in place, then you’re going to struggle to make effective use of R. Worse, you risk having algorithms applied incorrectly or inappropriately due to the skill gap.
None of the above discussion is meant to take anything away from the power and potential of R for organizations of any size. Rather, the discussion points are meant to be reminders to ensure that you don’t dive into R based on market hype without having a solid understanding of what you’re getting into and what it will take to succeed. If you install R on the machine of a business user with limited programming skill, limited analysis experience, and no understanding of how R goes about processing the necessary computations on your data, then you’ll end up far short of your expectations. For analytic professionals, on the other hand, who already know how to code, understand analytics methodologies, and are comfortable navigating the process of making R scalable at an enterprise level, R can provide a lot of value very quickly. If you want to succeed, make sure that you and your organization know what you R doing with R!
Sign up for the free insideAI News newsletter.
From R you can use the open-source CRAN package to export R to to the Predictive Model Markup Language (PMML). Using Zementis’ solutions you can immediately deploy your R model on any technology platform. Zementis takes care of the optimization and scaling of the model. We eliminate the costs and delays of having IT re-code a model developed in R to enable it to operate in a production environment.
More information is available at http://www.zementis.com.
Thanks for the info Bill – I’m very glad to see Teradata incorporating R into the stack! Distributing the R analytic is key in scaling one’s solution. At useR 2014, 0xdata’s H2O (open source in-memory prediction engine), was honored with a keynote mention by John Chambers as one of three promising projects in the R community. Here is a snippet of what he had to say:
https://www.youtube.com/watch?v=_hcpuRB5nGs#t=2188
From R, one can install H2O from CRAN by running: install.packages(“h2o”)
Another install option is to go to our website and grab the latest build from 0xdata.com/download.
At any rate, thank you Bill for the catalyst for this thought provoking topic!
Cheers,
Max Schloemer
http://www.0xdata.com
max@0xdata.com