Ask a Data Scientist: Recommender Systems

datascientist2Welcome to the first in a series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideAI News feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a big data question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week’s question is from a reader who wants to know more about recommender systems.

Q: What is a recommender system? How does it work?

A: A recommender system is usually my starting place for when I’m trying to explain what I do as a data scientist since it is the application of data science that people can identify with best. Think of your experience using Amazon or Netflix when you receive cross-sell recommendations – suggested book titles, and movies respectively. The recommendations are the result of the online business recording customer behavior (e.g. when you make a purchase or submit a rating) and then utilizing a machine learning algorithm to make predictions as to what other products you might like.

Benefits of recommender systems to the businesses using them include:

  • The ability to offer unique personalized service for the customer
  • Increase trust and customer loyalty
  • Increase sales, click-through rates, conversions, etc.
  • Opportunities for promotion, persuasion
  • Obtain more knowledge about customers

Recommender systems have changed the way people find products, information, and even other people. They study patterns of behavior to know what someone will prefer from among a collection of things he/she has never experienced.

The technology behind recommender systems has evolved over the past 20 years into a rich collection of tools that enable the practitioner or researcher to develop effective recommenders. The algorithms used in recommender systems include content-based filtering, user-user collaborative filtering, item-item collaborative filtering, dimensionality reduction, and interactive critique-based recommenders.

How does a recommender system work? In a classical model of recommendation system there are “users” and “items.” Users (often customers) have associated metadata (or content) such as age, gender, race and other demographic information. Items (often products) also have metadata such as text description, price, weight, etc. On top of that, there is interaction (e.g. transaction) between users and items, such as user A download/purchase movie B, user X gives a rating 5 to product Y and so on.

Collaborative filtering is the most prominent approach to generate recommendations. It is used by large, commercial e-commerce sites and is well understood, with various algorithms and variations. Plus, it is applicable to many domains, e.g. books, movies, apparel, etc. The general approach is to use the wisdom of the crowd to recommend items.

The basic idea and assumptions behind collaborative filtering are: customers give ratings to catalog items either explicitly or implicitly, and customers who had similar tastes in the past, will have similar tastes in the future. In this approach, we look purely at the interactions between user and item, and use that to perform our recommendation. The interaction data can be represented as a matrix (usually very sparse). Each cell represents the interaction between user and item. For example, the cell can contain the rating that user gives to the item, or the cell can be just an indicator of whether the interaction between user and item has happened. The algorithm is able to guess what a given missing value in the matrix should be. For example, to guess how user X will rate movie A given we know this user’s rating of item B, we can look at all users (or just those in the same age group as user X) who have rated both item A and item B, then compute an average rating from them. We can use the average to determine user X’s rating on item A given his rating on item B.

Making recommendations at scale is an important characteristic of today’s online landscape. To achieve this level of scalability, you might want to choose an architecture like Hadoop and use a machine learning library like Apache Mahout to construct the recommendation model.

If you have a question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com.

Data Scientist: Daniel D. Gutierrez – Managing Editor, insideAI News

 

 

 

 

Comments

  1. Great article. Could you explain the benefits between open vs proprietary Hadoop systems? I would like to understand the benefits or concerns between the two offerings in the market. Thanks! Kevin

    • This is an on-going debate in the Hadoop ecosystem – purely open source or proprietary that stems from past debates in other software genres going back years. On one side are open source purists who believe that data infrastructure software should be free and open with revenue generated solely from services. On the other side of the debate are those who feel the only viable long-term business model is selling proprietary software built on top of an open core. An examples of the former is Hortonworks, and and example of the latter is Cloudera – based on comments management for both firms have made publicly. It’s too early to tell which model will prevail. But there is one major point upon which Cloudera and Hortonworks agree. Open source software has proven that it can significantly disrupt entrenched and highly lucrative markets.