Recommender System (For Movies)
What is Recommendation System?
It’s one of the most poplar data since applications. It’s a system that predicts the likelihood that a user would prefer an item, based on his past behaviors. That can be done by employing a machine learning algorithm, which can predict a user preferences for a particular entity. There are wide variety of applications for the recommendation systems, and it is used by many of the big technology companies , in order to recommend products to their customers. For instance, Amazon used the recommendation systems for product recommendations, YouTube for video recommendations, Netflix and IMDB for movie recommendations and Facebook and Twitter for friend recommendations.
- The diagram below demonstrates the recommender systems method.
Recommendation System Mechanism:
The engine of the recommendation system filters the data via different machine learning algorithms, and based on that filtering, it can predicts the most relevant entities to be recommended. After studying the previous behaviors of the users, it recommends products/services that the used may be interested on.
The engine’s working of a recommendation is classified in these 3 steps:
1- Data Collection:
The techniques that can be used to collect data are:
- Explicit, where data are provided intentionally as an information (e.g. user’s input such as movies rating)
- Implicit, where data are provided intentionally but gathered from available data stream (e.g. search history, clicks, order history, etc…)
2- Data Storage:
It can be stored in a cloud storage such as SQL database, NoSQL database, or some other kind of object storage. However, it depends on the data type and amount as well. The more data that the storage can have for the model, the better recommendation system can be.
3- Recommendation System Methods:
There are several methods in recommendation systems, but there are two major approaches to filter data on the system:
- Collaborative Filtering
It is making recommend according to combination of your experience and experiences of other people. - Content-Based Filtering (The one that I used in implementing my movie recommendation system)
It is based on based on product attributes, which is the item description and the preferences of users’ profile. It calculates the similarity between different products on the basis of their attributes. It treats recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on product features.
- The diagram below demonstrates content-based filtering recommender systems.
Recommendation System Applications:
There are wide and verity applications for the recommendation systems, especially in the data science filed. For example, the music and video companies like Netflix, Youtube and Spotify use them to generate music and video recommendations. Amazon uses it for product recommendations. Social media platforms such as Facebook and Twitter use them for friends and content recommendations. Restaurants and hotels use it to generate food related recommendations. As well as in the research articles, financial services and life insurance.
Implementing Movie Recommendation System in Python
The developed movie recommender system project uses the correlation between the movies attributes. Thus, it will find the similarities between the movies to make the suitable recommendation for the user. It uses the MovieLense data from Kaggle, and it employees Machine Learning algorithm to filter data using the content-based filtering method, in the purpose of making those evaluation and predictions. It also uses the K-nearest neighbor classifier model, which finds the k most similar items to a particular instance based on a given distance metric.
- The diagram below demonstrates the K-nearest neighbor classifier model.
After doing some Exploratory Data Analysis (EDA), I found out that there are only 6 features in the 2 dataset (merged). Thus, I decided to extract new features from the given ones as much as it possible. Also, here are some noticed things from exploring the dataset,
About the dataset:
• Number of Movies in the Dataset: 10325 movies
• Number of Users in the Dataset: 668 users
Plot 1:
• Most of the rated movies are having a rate of 4.0
- Only 1198 Movies that have a rate of 0.5 (lowest rate)
Plot 2:
- It shows the count of the top 10 genre that the movies in this dataset are catorized.
- The genre that represents the higher number of movies is Drama
Project Results:
After using K-nearest neighbor classifier as a model to predict the model, its accuracy score was 48.5% and it had beat the baseline’s by 48.2%.
I tried to implement the model by optimizing it with the GridSearchCV best parameters, but the accuracy did not increase.
Further Recommendations:
Although that I extracted more 20 features from the 6 ones, there were shortage in the information of the movies and their details! So, I believe that the accuracy score could be better if I had more details that related to the movies. (e.g. actors & director)
Links:
GitHub Repo of the project
Pager of the project