Welcome back to our series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideAI News feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a big data question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week’s question is from a reader who asks about the role of exploratory data analysis in data science.
Q: What is the role of exploratory data analysis in data science?
A: Once the often laborious task of data munging is complete, the next step in the data science process is to become intimately familiar with the data set by performing what’s called Exploratory Data Analysis (EDA). The way to gain this level of familiarity is to utilize the features of the statistical environment you’re using (R, Matlab, SAS, Python, etc.) that support this effort – numeric summaries, aggregations, distributions, densities, reviewing all the levels of factor variables, applying general statistical methods, exploratory plots, and expository plots.
It is always a good idea to explore a data set with multiple exploratory techniques, especially when they can be done together for comparison. Every data scientist should compile a cookbook of techniques in exploratory data analysis. Once you fully understand your data set, it is quite possible that you may need to revisit one or more data munging tasks in order to refine or transform the data even further. The goal of exploratory data analysis is to obtain confidence in your data to a point where you’re ready to engage a machine learning algorithm.
Another side benefit of EDA is to refine your selection of feature variables that will be used later for machine learning. Once you gain deep familiarity with your data set, you may need to revisit the feature engineering step – you may find the features you selected do not serve their intended purpose. Moreover you may discover other features that add to the overall picture the data presents. Once you complete the EDA stage you should have a firm feature set with which to use for supervised and unsupervised statistical learning.
In a hurry to get to the machine learning stage, some data scientists either entirely skip the exploratory process or do a very perfunctory job. This is a mistake with many implications, including generating inaccurate models, generating accurate models but on the wrong data, not creating the right types of variables in data preparation, and using resources inefficiently because of realizing only after generating models that perhaps the data is skewed, or has outliers, or has too many missing values, or finding that some values are inconsistent.
If you have a question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com.
Data Scientist: Daniel D. Gutierrez – Managing Editor, insideAI News