Ask a Data Scientist: The Data Science Process

datascientist2_featuredWelcome back to our series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideAI News feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a big data question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week’s question is from a reader who wonders if there is a general process for conducting data science projects.

Q: Is there a typical “data science process?”

A: Great question! There most definitely is a general formula followed by data scientists in striving to achieve best practices with a data science project. The figure below encapsulates the high-level steps in the so-called “data sci­ence pipeline” that contribute to the success of a project: under­standing the goal of the project, data access, data munging, exploratory data analysis, feature engi­neering, model selection, model validation, deploy, visualization and communicate the results.

DataScienceProcess

Data science involves understanding and preparing the data, defining the statistical learning model, and following the data science process. Statistical learning models can assume many shapes and sizes, depending on their complexity and the application for which they are designed. The first step is to un­derstand what questions you are trying to answer for your organization. The level of detail and com­plexity of your questions will increase as you be­come more comfortable with the analytic process. The most important steps in the data science process are as follows:

  • Define the project outcomes and deliverables, state the scope of the effort, establish busi­ness objectives, and identify the data sets to be used.
  • Undertake data collection and data under­standing. Some data scientists believe that domain knowledge superfluous, but from my experience, having a domain expert available to consult with can be an important factor for success.
  • Perform data munging – the process of in­specting, cleaning, and transforming the data.
  • Utilize techniques of exploratory data analysis (EDA) – use graphical techniques with the objective of discovering useful information, arriving at conclusions. Apply statistics to validate the assumptions, hypothesis and test using stan­dard statistical methods.
  • Apply statistical modeling principles to provide the abil­ity to automatically create accurate predictive models about the future.
  • Evaluate the model allowing you to verify the robustness of the chosen model and make mid-course corrections. Test models on exist­ing data and apply predictions to new data.
  • Select a deployment option to open up the analytical results to every day decision making and to get results by automating the decisions based on the modeling.

Each of the above steps can be considered itera­tive and may be revisited as needed. It should be noted that the data munging step often is very time-consuming depending on the cleanliness of the incoming data and can take up to 70% of the overall project timeline.

If you have a question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com.

Data Scientist: Daniel D. Gutierrez – Managing Editor, insideAI News

 

 

 

 

Comments

  1. I like the definitions here and would only add that the steps of data preparation (which you touch up in the “Undertake data collection and data under­standing” and “Perform data munging”) can now be accelerated thanks to modern day machine learning and algorithms that aid the scientist in understanding, cleaning and integrating their data, regardless of how much of it there is or where it came from. It should include the ability to reuse the data prep work that is done, collaborate with others as data is being organized/shaped and an emergent governance capability (the system log and learns as data prep work is being performed so that actual standards/practices can emerge in real-time).