I recently ran across a blog post that discusses a very important characteristic for machine learning solutions – Generalization. If you’ve ever wondered about the primary reason why machines can learn, generalization is the concept you need to understand. It is the premise underlying all statistical learning and it goes something like this.
We start by training an algorithm using data for which we already know the answer. This “labeled” data set serves as the training set for the algorithm. The desire is to obtain coefficients for the algorithm such that it will make correct predictions and we can check the algorithm’s performance since we already know the answers. The score is called the Mean Square Error (MSE) and the goal is to minimize it. The next step is to hold out a portion (usually 40%) of the training set and use it later during the cross validation process. Using this cross validation set, you can tune your to make better predictions, but you don’t want to score too well with the training data otherwise the algorithm will become over-fit and will actually make worse predictions on previously unseen data known as the test set.
The graph below comes from the blog article although it is a very common graph that all data scientists should be familiar with. It shows that if the MSE of your training set goes too low, there is a point where the MSE of your cross validation set will increase, meaning poor predicting power. This is the point where the algorithm becomes over-fit and it loses the power to generalize. This is where machine learning becomes very nuanced and requires the insight and experience of a data scientist.
The blog article refers to generalization as a “trick” but I’m not sure I’d use that word to describe it. Generalization is simply the nature of machine learning algorithms. And to the author of the article who is incensed that not enough authors discuss generalization in introductory materials, I promise that I will put it in the book I’m working on now – Introduction to Machine Learning with R!