I take it with a grain of salt when a book author makes a comment like “This is the first book on synthetic data for deep learning, and its breadth of coverage may render this book as the default reference on synthetic data for years to come.” An author declaring his book as the seminal text on the subject is a heady claim, but in this case, after carefully going through the book from cover to cover, I think can agree with the premise. This book does in fact serve as a comprehensive survey of this burgeoning field. So if you’re interested in synthetic data, spending time with this book is definitely a well-placed use of time.
“Synthetic Data for Deep Learning,” by Sergey I. Nikolenko (published by Springer), represents a very good academic treatment of the subject. But what gives the book more street cred is the fact that the author is also Chief Research Officer for Synthesis AI, a start-up company pioneering this accelerating field. It’s nice to know the book represents both the academic and practical perspectives of the topic.
The rising importance of synthetic data is real. A recent industry prediction by Gartner projects that by 2025, synthetic data will reduce personal customer data collection, avoiding 70% of privacy violation sanctions. The new methods use a radically different approach compared to classic graphics tools, and achieve new highs of photo realism.
“As synthetic data, which is generated using AI techniques, grows in popularity, it can serve as a proxy for real data, reducing or eliminating risks of exposing private consumer information or sensitive data. Because it’s not real data, it lessens regulatory concerns and can actually provide more precise insights as AI can better model often-unpredictable customer behavior. Synthetic data can be used to train and test AI models to handle unplanned disruptions, unexpected events and scenario planning — ultimately creating a more resilient organization.”
Synthetic data refers to computer-generated images and simulations used to train computer vision models. Synthetic data is emerging to be an essential element in building accurate and capable AI models, as it provides developers with vast amounts of perfectly labeled data on-demand.
The book’s coverage of synthetic data in support of deep learning, specifically computer vision, is solid. It takes a three pronged approach for describing the use of synthetic data in machine learning:
- Using synthetically generated data sets to train machine learning models directly. Most of the book is devoted to this approach which is often taken in computer vision. The discussion turns to training models on synthetic data with the intention to use them on real data (see Chapter 6 – Synthetic Data for Basic Computer Vision Problems, Chaper 7 – Synthetic Simulated Environments, and Chapter 8 – Synthetic Data Outside Computer Vision). Also discussed is training generative models that refine synthetic data in order to make it more suitable for training or adapt the model to allow it to be training on synthetic data (see Chapter 10 – Synthetic-to-Real Domain Adaption and Refinement).
- Using synthetic data to augment existing real datasets so that the resulting hybrid data sets are better suited for the training the models. In this case, synthetic data is usually employed to cover parts of the data distribution that are not sufficiently represented in the real data set, with the main purposed being to alleviate data set bias. The synthetic data can be generated separately with computer-generated imagery (CGI)-based methods for computer vision (see Chapter 3 – Deep Neural networks for Computer Vision , and Chapter 7).
- Using synthetic data for a variety of use cases including healthcare, finance, and social sciences. Included is a discussion of privacy guarantees (see Chapter 11 – Privacy Guarantees in Synthetic Data).
As an added bonus, for those new to deep learning, Chapter 2 – Deep Learning and Optimization, is a brief introduction to deep learning, including a nice overview in Section 2.2 for “A (Very) Brief Introduction to Machine Learning.”
I would judge this book as typical of the texts I encountered in graduate school, so as an academic, I felt right at home consuming the materials. There is a fair amount of the mathematics you typically see in the field, requiring an understanding of statistics (Bayesian), linear algebra, Calculus – all of which I appreciated.
If I had to come up with a criticism of the book it would be its lack of an index. In a nomenclature-rich field as deep learning, a well-crafted index would go a long way. In fact, I don’t believe I ever saw a Springer book without one. But on the plus side, the book includes a comprehensive 50+ page References section that includes a number of real gems including citations for seminal papers that constitute the genesis of deep learning. I’ve already located a handful of papers I did not have in my personal collection.
All in all, this text will be a valuable addition to your professional library. Highly recommended.
Contributed by Daniel D. Gutierrez, Editor-in-Chief and Resident Data Scientist for insideAI News. In addition to being a tech journalist, Daniel also is a consultant in data scientist, author, educator and sits on a number of advisory boards for various start-up companies.
Sign up for the free insideAI News newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1