TECH TIP: The Importance of Training and Test Set Separation

bigdata_nonprofitOne of the most important mechanisms in machine learning is to train your algorithm on a training set that is separate and distinct from the test set for which you’ll gauge its accuracy. Failure to do this will result in a model that may not generalize to yet unseen data. Unfortunately, not every research team understands this principle. The most recent high-profile case-in-point is the now debunked classifier that was developed by a team of Australian researchers as a genetic test for autism.

Appearing in The Scientist journal, a paper demonstrated how the original research results were flawed – “Genetic Test for Autism Refuted,” by Ed Yong. According to the paper:

The Melbourne team initially tested the accuracy of their risk classifier on the same group of people whom they used to identify their SNP set. This is bad practice. To appropriately assess the accuracy of a classifier, the sample which is used to develop it must be fully distinct from the sample on which it is tested.

As any data machine learning practitioner will tell you, this is poor practice as you will be “contaminating” your model with information about the test set. This can easily lead to poor generalization. The moral of the story is to always make sure you have a completely separate data set to test your final model on after all parameter tuning and training has been done. For more information about the division of training and test sets, see HERE.